Extracting job information from LinkedIn Jobs using BeautifulSoup and Selenium

The goal of this project is to scrape job postings and related information from LinkedIn. We will be scraping infromation about 'Data Analyst' positions in 'Canada'. After running the below code, we will get the following information about the job posting:

  1. Date
  2. Title
  3. Company Name
  4. Location
  5. Job Description
  6. Job Level
  7. Job Type
  8. Function
  9. Industry
  10. Job ID

You can also use this code for different type of jobs with different location. To do that, follow the below process:

  1. Open this link in a chrome incognito mode.
  2. Specify to job title and location in the search bar.
  3. Copy and pase sortBy=DD& after location=(will show your searched location) in the weblink.
  4. Copy the final link and replace url variable with the new url in code block 2.
  5. To search data for more jobs, specify the numner of jobs (in multiple of 25 like 50 or 75 or 100 and so on) against variable called 'no_of_jobs' in code block 2.
  6. After all the above the steps are done, run the code

For this example, we will only look for 25 recent jobs. I hope you enjoy this.

In [1]:
# importing packages
import pandas as pd
import re

from bs4 import BeautifulSoup
from datetime import date, timedelta, datetime
from IPython.core.display import clear_output
from random import randint
from requests import get
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
from time import time
start_time = time()

from warnings import warn
In [2]:
# replace variables here.
url = "https://www.linkedin.com/jobs/search/?f_TPR=r604800&geoId=101174742&keywords=data%20analyst&location=Canada&sortBy=DD"
no_of_jobs = 25
In [3]:
# this will open up new window with the url provided above 
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
action = ActionChains(driver)
In [4]:
# to show more jobs. Depends on number of jobs selected
i = 2
while i <= (no_of_jobs/25): 
    driver.find_element_by_xpath('/html/body/main/div/section/button').click()
    i = i + 1
    sleep(5)
In [5]:
# parsing the visible webpage
pageSource = driver.page_source
lxml_soup = BeautifulSoup(pageSource, 'lxml')

# searching for all job containers
job_container = lxml_soup.find('ul', class_ = 'jobs-search__results-list')

print('You are scraping information about {} jobs.'.format(len(job_container)))
You are scraping information about 25 jobs.
In [6]:
# setting up list for job information
job_id = []
post_title = []
company_name = []
post_date = []
job_location = []
job_desc = []
level = []
emp_type = []
functions = []
industries = []

# for loop for job title, company, id, location and date posted
for job in job_container:
    
    # job title
    job_titles = job.find("span", class_="screen-reader-text").text
    post_title.append(job_titles)
    
    # linkedin job id
    job_ids = job.find('a', href=True)['href']
    job_ids = re.findall(r'(?!-)([0-9]*)(?=\?)',job_ids)[0]
    job_id.append(job_ids)
    
    # company name
    company_names = job.select_one('img')['alt']
    company_name.append(company_names)
    
    # job location
    job_locations = job.find("span", class_="job-result-card__location").text
    job_location.append(job_locations)
    
    # posting date
    post_dates = job.select_one('time')['datetime']
    post_date.append(post_dates)

# for loop for job description and criterias
for x in range(1,len(job_id)+1):
    
    # clicking on different job containers to view information about the job
    job_xpath = '/html/body/main/div/section/ul/li[{}]/img'.format(x)
    driver.find_element_by_xpath(job_xpath).click()
    sleep(3)
    
    # job description
    jobdesc_xpath = '/html/body/main/section/div[2]/section[2]/div'
    job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
    job_desc.append(job_descs)
    
    # job criteria container below the description
    job_criteria_container = lxml_soup.find('ul', class_ = 'job-criteria__list')
    all_job_criterias = job_criteria_container.find_all("span", class_='job-criteria__text job-criteria__text--criteria')
    
    # Seniority level
    seniority_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[1]'
    seniority = driver.find_element_by_xpath(seniority_xpath).text.splitlines(0)[1]
    level.append(seniority)
    
    # Employment type
    type_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[2]'
    employment_type = driver.find_element_by_xpath(type_xpath).text.splitlines(0)[1]
    emp_type.append(employment_type)
    
    # Job function
    function_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[3]'
    job_function = driver.find_element_by_xpath(function_xpath).text.splitlines(0)[1]
    functions.append(job_function)
    
    # Industries
    industry_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[4]'
    industry_type = driver.find_element_by_xpath(industry_xpath).text.splitlines(0)[1]
    industries.append(industry_type)
    
    x = x+1
In [7]:
# to check if we have all information
print(len(job_id))
print(len(post_date))
print(len(company_name))
print(len(post_title))
print(len(job_location))
print(len(job_desc))
print(len(level))
print(len(emp_type))
print(len(functions))
print(len(industries))
25
25
25
25
25
25
25
25
25
25
In [8]:
# creating a dataframe
job_data = pd.DataFrame({'Job ID': job_id,
'Date': post_date,
'Company Name': company_name,
'Post': post_title,
'Location': job_location,
'Description': job_desc,
'Level': level,
'Type': emp_type,
'Function': functions,
'Industry': industries
})

# cleaning description column
job_data['Description'] = job_data['Description'].str.replace('\n',' ')

print(job_data.info())
job_data.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 10 columns):
Job ID          25 non-null object
Date            25 non-null object
Company Name    25 non-null object
Post            25 non-null object
Location        25 non-null object
Description     25 non-null object
Level           25 non-null object
Type            25 non-null object
Function        25 non-null object
Industry        25 non-null object
dtypes: object(10)
memory usage: 2.0+ KB
None
Out[8]:
Job ID Date Company Name Post Location Description Level Type Function Industry
0 1486231080 2019-09-20 Alberta Health Services Administrative Support III - Technical Validator Edmonton, CA Your Opportunity: Are you are an experienced, ... Entry level Temporary Administrative Nonprofit Organization ManagementHealth, Welln...
1 1486232078 2019-09-20 Alberta Health Services Administrative Support III - Technical Validator Calgary, CA Your Opportunity: Are you are an experienced, ... Entry level Temporary Administrative Nonprofit Organization ManagementHealth, Welln...
2 1460558684 2019-09-20 fin/tech Professionals Inc. Full Stack Developer Toronto, Canada Area The Client specializes in an innovative softwa... Mid-Senior level Full-time Information Technology Information Technology and Services
3 1462375355 2019-09-20 jobleads.com - Careers for Senior-Level Profes... Senior Data Analyst/Advisor Calgary, CA Utilizing knowledge of accounting concepts, GA... Associate Full-time Information Technology Internet
4 1486227566 2019-09-20 Alberta Securities Commission Analyst, Data Analytics & Risk (24 Month Contr... Calgary, Alberta, Canada Analyst, Data Analytics & Risk (24 Month Contr... Associate Contract OtherAccounting/AuditingAnalyst Capital Markets
In [9]:
job_data.to_csv('LinkedIn Job Data.csv', index=0)

Author: Amandeep Saluja