What Is Web Scraping?
Web scraping is the process of programmatically extracting data from websites. Instead of copying and pasting content manually, you write code to fetch a web page and pull out the information you need. Python is one of the most popular languages for this task thanks to its simple syntax and powerful libraries.
Important note: Always check a website's robots.txt file and Terms of Service before scraping. Only scrape sites you have permission to access, and don't overload servers with rapid requests.
What You'll Need
For this tutorial, you'll use two Python libraries:
- Requests — to fetch the HTML content of a web page
- BeautifulSoup4 — to parse and navigate the HTML
Install them with pip:
pip install requests beautifulsoup4
Step 1: Fetch a Web Page
Start by fetching the raw HTML of a page using the requests library:
import requests
url = 'https://books.toscrape.com/'
response = requests.get(url)
print(response.status_code) # 200 means success
print(response.text[:500]) # Preview the first 500 characters
We're using books.toscrape.com — a site specifically designed for practicing web scraping, so it's perfectly legal to scrape.
Step 2: Parse the HTML with BeautifulSoup
Once you have the raw HTML, pass it to BeautifulSoup for parsing:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text) # Prints the page title
The 'html.parser' argument tells BeautifulSoup which parser to use. It's built into Python's standard library, so no extra install is needed.
Step 3: Find Elements Using CSS Selectors
BeautifulSoup lets you find elements using tag names, class names, and CSS selectors. Inspect the page in your browser's DevTools to identify the right selectors.
# Find all book titles on the page
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
print(f"{title} — {price}")
Step 4: Navigate Multiple Pages
Real scraping often requires visiting multiple pages. Here's how to loop through paginated results:
base_url = 'https://books.toscrape.com/catalogue/page-{}.html'
for page_num in range(1, 6): # Scrape first 5 pages
url = base_url.format(page_num)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
print(f"{title} — {price}")
Step 5: Save the Data to a CSV
Extracted data is only useful if you can store it. Use Python's built-in csv module:
import csv
with open('books.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Title', 'Price'])
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
writer.writerow([title, price])
Key BeautifulSoup Methods
| Method | What It Does |
|---|---|
find('tag') | Returns the first matching element |
find_all('tag') | Returns a list of all matching elements |
select('css selector') | Finds elements using CSS selector syntax |
element.text | Gets the visible text content |
element['attribute'] | Gets an attribute value (e.g., href, src) |
Next Steps
Once you're comfortable with BeautifulSoup, explore:
- Scrapy — a full-featured scraping framework for larger projects
- Playwright or Selenium — for scraping JavaScript-rendered pages
- Rate limiting — add
time.sleep()between requests to avoid overwhelming servers
Web scraping is a powerful skill for data collection, price monitoring, research automation, and more. Build responsibly!