What Is Web Scraping?

Web scraping is the process of programmatically extracting data from websites. Instead of copying and pasting content manually, you write code to fetch a web page and pull out the information you need. Python is one of the most popular languages for this task thanks to its simple syntax and powerful libraries.

Important note: Always check a website's robots.txt file and Terms of Service before scraping. Only scrape sites you have permission to access, and don't overload servers with rapid requests.

What You'll Need

For this tutorial, you'll use two Python libraries:

  • Requests — to fetch the HTML content of a web page
  • BeautifulSoup4 — to parse and navigate the HTML

Install them with pip:

pip install requests beautifulsoup4

Step 1: Fetch a Web Page

Start by fetching the raw HTML of a page using the requests library:

import requests

url = 'https://books.toscrape.com/'
response = requests.get(url)

print(response.status_code)  # 200 means success
print(response.text[:500])   # Preview the first 500 characters

We're using books.toscrape.com — a site specifically designed for practicing web scraping, so it's perfectly legal to scrape.

Step 2: Parse the HTML with BeautifulSoup

Once you have the raw HTML, pass it to BeautifulSoup for parsing:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)  # Prints the page title

The 'html.parser' argument tells BeautifulSoup which parser to use. It's built into Python's standard library, so no extra install is needed.

Step 3: Find Elements Using CSS Selectors

BeautifulSoup lets you find elements using tag names, class names, and CSS selectors. Inspect the page in your browser's DevTools to identify the right selectors.

# Find all book titles on the page
books = soup.find_all('article', class_='product_pod')

for book in books:
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    print(f"{title} — {price}")

Step 4: Navigate Multiple Pages

Real scraping often requires visiting multiple pages. Here's how to loop through paginated results:

base_url = 'https://books.toscrape.com/catalogue/page-{}.html'

for page_num in range(1, 6):  # Scrape first 5 pages
    url = base_url.format(page_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    books = soup.find_all('article', class_='product_pod')
    for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        print(f"{title} — {price}")

Step 5: Save the Data to a CSV

Extracted data is only useful if you can store it. Use Python's built-in csv module:

import csv

with open('books.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Price'])
    
    for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        writer.writerow([title, price])

Key BeautifulSoup Methods

MethodWhat It Does
find('tag')Returns the first matching element
find_all('tag')Returns a list of all matching elements
select('css selector')Finds elements using CSS selector syntax
element.textGets the visible text content
element['attribute']Gets an attribute value (e.g., href, src)

Next Steps

Once you're comfortable with BeautifulSoup, explore:

  • Scrapy — a full-featured scraping framework for larger projects
  • Playwright or Selenium — for scraping JavaScript-rendered pages
  • Rate limiting — add time.sleep() between requests to avoid overwhelming servers

Web scraping is a powerful skill for data collection, price monitoring, research automation, and more. Build responsibly!