Ayten Yesim Semchenko, Ph.D.

Researcher

Menu
  • About me
  • Academic Publications
  • Education
  • Games
  • Narratives & Notions
  • Contact
Menu

Web scraping and Beautiful Soup

Posted on October 29, 2022July 14, 2025 by Yesim Semchenko

If you want to search for a course to take, with so many options on the Internet, you can have choice paralysis. To simplify your decision-making, if allowed, you can scrape the data and make comparisons more conveniently.

Let us say that you are searching for online data science courses on Reeds. And this is the link for that search. Before any web scraping, you need to right-click with your mouse and choose Inspect. If it is an HTML web page, then you will see how it is structured as well as the location of the data that you are interested in. From that point on, you can use the Beautiful Soup library from Python and extract your data! Let us first extract the course provider name and the course links in one data frame:

#Load the necessary libraries
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup

#Create a list called "full" where you save your data.
page = [1, 2, 3, 4, 5, 6, 7, 8, 9] #let us extract the first 9 pages
full = []
for i in page:
    url = 'https://www.reed.co.uk/courses/data-science?pageno={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    course = soup.find('script', type='application/ld+json')
    provider = [el['provider'] for el in json.loads(course.text)['itemListElement']]
    full.append(provider)
full = [l for li in full for l in li]

#Create a data frame with that list
data = pd.DataFrame(full)
data.columns = ['type', 'provider_name', 'course_links']
data.head()

Output:

Let us also extract the course descriptions:

page=[1, 2, 3, 4, 5, 6, 7, 8, 9]
full2 = []
for i in page:
    url = 'https://www.reed.co.uk/courses/data-science?pageno={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    course_description = soup.find('script', type='application/ld+json')
    description = [el['description'] for el in json.loads(course_description.text)['itemListElement']]
    full2.append(description)

full2 = [l for li in full2 for l in li]

df2 = pd.DataFrame(full2)
df2.columns = ['course_description']
df2.head()

Output:

We can do the same thing for job advertising websites, so long as it is allowed:), and doing so can potentially ease job search and understanding what most employers want. Overall, it is a skill that you will not regret having!

Cheers!

Category: Uncategorized

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • August 2025 (2)
  • July 2025 (5)
  • October 2022 (1)
  • May 2022 (1)
  • April 2022 (1)
  • December 2021 (1)
  • September 2021 (2)
  • December 2020 (1)
  • June 2020 (1)
  • October 2019 (2)
  • August 2019 (1)
  • July 2019 (3)
  • June 2019 (3)
© 2025 Ayten Yesim Semchenko, Ph.D. | Powered by Minimalist Blog WordPress Theme