Ayten Yesim Semchenko, Ph.D.

Behavioral Statistician/Researcher

Menu
  • About me
  • Academic Publications
  • Education
  • Games
  • Python and Bash Diaries
  • Contact
Menu

Web scraping and Beautiful Soup

Posted on October 29, 2022November 18, 2022 by Yesim Semchenko

Let us say that you want to search for a course to take, and there are many online courses on the Internet, making you confused. Instead of clicking on each and every course, and going back and forth to compare, if allowed, you can scrape the data and make your comparisons more conveniently.

Let us say that you are searching for online data science courses on Reeds. And this is the link for that search. Before any web scraping, you need to right-click with your mouse and choose Inspect. If it is an HTML web page, then you will see how it is structured as well as the location of the data that you are interested in. From that point on, you can use the Beautiful Soup library from Python and extract your data! Let us first extract the course provider name and the course links in one data frame:

#Load the necessary libraries
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup

#Create a list called "full" where you save your data.
page = [1, 2, 3, 4, 5, 6, 7, 8, 9] #let us extract the first 9 pages
full = []
for i in page:
    url = 'https://www.reed.co.uk/courses/data-science?pageno={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    course = soup.find('script', type='application/ld+json')
    provider = [el['provider'] for el in json.loads(course.text)['itemListElement']]
    full.append(provider)
full = [l for li in full for l in li]

#Create a data frame with that list
data = pd.DataFrame(full)
data.columns = ['type', 'provider_name', 'course_links']
data.head()

Output:

Let us also extract the course descriptions:

page=[1, 2, 3, 4, 5, 6, 7, 8, 9]
full2 = []
for i in page:
    url = 'https://www.reed.co.uk/courses/data-science?pageno={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    course_description = soup.find('script', type='application/ld+json')
    description = [el['description'] for el in json.loads(course_description.text)['itemListElement']]
    full2.append(description)

full2 = [l for li in full2 for l in li]

df2 = pd.DataFrame(full2)
df2.columns = ['course_description']
df2.head()

Output:

We can do the same thing for job advertising websites, so long as it is allowed:), which can potentially ease job search and understanding what most employers want. Overall, it is a skill that you will not regret having!

Cheers!

Category: Uncategorized

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • October 2022 (1)
  • May 2022 (1)
  • April 2022 (1)
  • December 2021 (1)
  • September 2021 (2)
  • December 2020 (1)
  • June 2020 (1)
  • October 2019 (2)
  • August 2019 (1)
  • July 2019 (3)
  • June 2019 (3)
© 2025 Ayten Yesim Semchenko, Ph.D. | Powered by Minimalist Blog WordPress Theme