Web scraping and Beautiful Soup – Ayten Yesim Semchenko, Ph.D.

If you want to search for a course to take, with so many options on the Internet, you can have choice paralysis. To simplify your decision-making, if allowed, you can scrape the data and make comparisons more conveniently.

Let us say that you are searching for online data science courses on Reeds. And this is the link for that search. Before any web scraping, you need to right-click with your mouse and choose Inspect. If it is an HTML web page, then you will see how it is structured as well as the location of the data that you are interested in. From that point on, you can use the Beautiful Soup library from Python and extract your data! Let us first extract the course provider name and the course links in one data frame:

#Load the necessary libraries
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup

#Create a list called "full" where you save your data.
page = [1, 2, 3, 4, 5, 6, 7, 8, 9] #let us extract the first 9 pages
full = []
for i in page:
    url = 'https://www.reed.co.uk/courses/data-science?pageno={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    course = soup.find('script', type='application/ld+json')
    provider = [el['provider'] for el in json.loads(course.text)['itemListElement']]
    full.append(provider)
full = [l for li in full for l in li]

#Create a data frame with that list
data = pd.DataFrame(full)
data.columns = ['type', 'provider_name', 'course_links']
data.head()

Output:

Let us also extract the course descriptions:

page=[1, 2, 3, 4, 5, 6, 7, 8, 9]
full2 = []
for i in page:
    url = 'https://www.reed.co.uk/courses/data-science?pageno={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    course_description = soup.find('script', type='application/ld+json')
    description = [el['description'] for el in json.loads(course_description.text)['itemListElement']]
    full2.append(description)

full2 = [l for li in full2 for l in li]

df2 = pd.DataFrame(full2)
df2.columns = ['course_description']
df2.head()

Output:

We can do the same thing for job advertising websites, so long as it is allowed:), and doing so can potentially ease job search and understanding what most employers want. Overall, it is a skill that you will not regret having!

Cheers!

Category: Uncategorized

Leave a Reply Cancel reply