Skip to content

HTML Parsing

After retrieving a web page, you need to extract the specific data you want from the HTML. This is called parsing. There are three main approaches, and the right choice depends on the complexity of your extraction.

Interactive lesson

This topic is covered in depth in the Scrapfly Academy: HTML Parsing lesson with live examples.

Three Approaches

1. CSS Selectors

The most common and recommended approach. CSS selectors use the same syntax as CSS stylesheets to target HTML elements.

```python from parsel import Selector

html = "

" sel = Selector(html)

Select by class

price = sel.css(".price::text").get() # "$10"

Select by element and attribute

link = sel.css("h3 a::attr(href)").get() # "/item/1"

Select multiple elements

all_prices = sel.css(".price::text").getall() # ["$10"] ```

CSS selectors are easy to learn and cover 90% of scraping needs. When you inspect an element in browser DevTools, you can right-click and "Copy selector" to get the CSS selector for it.

2. XPath

A more powerful query language that supports conditions, text functions, and ancestor/descendant traversal that CSS selectors cannot do.

```python from parsel import Selector

sel = Selector(html)

XPath equivalent of CSS selectors

price = sel.xpath("//span[@class='price']/text()").get() # "$10"

XPath can do things CSS cannot:

Select elements containing specific text

potions = sel.xpath("//a[contains(text(), 'Potion')]/@href").getall()

Select parent of an element

parent = sel.xpath("//span[@class='price']/..").get()

Select by position

first_product = sel.xpath("(//div[@class='product'])[1]").get() ```

Use XPath when CSS selectors are not expressive enough, for example when you need to select based on text content or navigate to parent elements.

3. Object-Based Parsing

Libraries like BeautifulSoup turn HTML into a tree of Python objects you can navigate programmatically. This is useful for algorithmic traversal but is slower than selector-based approaches.

```python from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

Navigate the tree

product = soup.find("div", class_="product") title = product.find("h3").get_text() price = product.find("span", class_="price").string ```

  1. Start with CSS selectors - simple, fast, covers most cases
  2. Use XPath as fallback - for complex selections CSS cannot handle
  3. Use object parsers for algorithms - when you need programmatic traversal

Libraries by Language

Language CSS Selectors XPath Object-Based
Python parsel, beautifulsoup parsel, lxml beautifulsoup
JavaScript cheerio - jsdom
Go goquery htmlquery -
PHP domcrawler domcrawler simple-html-dom
Ruby nokogiri nokogiri -

For speed-sensitive applications, selectolax is up to 30x faster than BeautifulSoup for Python, and lxml is significantly faster than html.parser.

Common Patterns

Extracting a List of Items

Most scraping involves finding a list of items and extracting fields from each:

```python from parsel import Selector

sel = Selector(response.text)

for product in sel.css(".product-card"): item = { "name": product.css("h3::text").get(), "price": product.css(".price::text").get(), "url": product.css("a::attr(href)").get(), "image": product.css("img::attr(src)").get(), } print(item) ```

Handling Missing Data

Not every element will always be present. Use defaults to avoid errors:

python price = product.css(".price::text").get(default="N/A") rating = product.css(".rating::attr(data-score)").get(default="0")

Cleaning Text

Scraped text often contains extra whitespace:

```python text = product.css("p::text").get("").strip()

or with XPath

text = product.xpath("normalize-space(.//p/text())").get() ```

Next Steps

Was this page helpful?