HTML Parsing

After retrieving a web page, you need to extract the specific data you want from the HTML. This is called parsing. There are three main approaches, and the right choice depends on the complexity of your extraction.

Interactive lesson

This topic is covered in depth in the Scrapfly Academy: HTML Parsing lesson with live examples.

Three Approaches

1. CSS Selectors

The most common and recommended approach. CSS selectors use the same syntax as CSS stylesheets to target HTML elements.

```python from parsel import Selector

html = "

Widget

$10

" sel = Selector(html)

Select by class

price = sel.css(".price::text").get() # "$10"

Select by element and attribute

link = sel.css("h3 a::attr(href)").get() # "/item/1"

Select multiple elements

all_prices = sel.css(".price::text").getall() # ["$10"] ```

CSS selectors are easy to learn and cover 90% of scraping needs. When you inspect an element in browser DevTools, you can right-click and "Copy selector" to get the CSS selector for it.

2. XPath

A more powerful query language that supports conditions, text functions, and ancestor/descendant traversal that CSS selectors cannot do.

```python from parsel import Selector

sel = Selector(html)

XPath equivalent of CSS selectors

price = sel.xpath("//span[@class='price']/text()").get() # "$10"

XPath can do things CSS cannot:

Select elements containing specific text

potions = sel.xpath("//a[contains(text(), 'Potion')]/@href").getall()

Select parent of an element

parent = sel.xpath("//span[@class='price']/..").get()

Select by position

first_product = sel.xpath("(//div[@class='product'])[1]").get() ```

Use XPath when CSS selectors are not expressive enough, for example when you need to select based on text content or navigate to parent elements.

3. Object-Based Parsing

Libraries like BeautifulSoup turn HTML into a tree of Python objects you can navigate programmatically. This is useful for algorithmic traversal but is slower than selector-based approaches.

```python from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

Navigate the tree

product = soup.find("div", class_="product") title = product.find("h3").get_text() price = product.find("span", class_="price").string ```

Recommended Priority

Start with CSS selectors - simple, fast, covers most cases
Use XPath as fallback - for complex selections CSS cannot handle
Use object parsers for algorithms - when you need programmatic traversal

Libraries by Language

Language	CSS Selectors	XPath	Object-Based
Python	parsel, beautifulsoup	parsel, lxml	beautifulsoup
JavaScript	cheerio	-	jsdom
Go	goquery	htmlquery	-
PHP	domcrawler	domcrawler	simple-html-dom
Ruby	nokogiri	nokogiri	-

For speed-sensitive applications, selectolax is up to 30x faster than BeautifulSoup for Python, and lxml is significantly faster than html.parser.

Common Patterns

Extracting a List of Items

Most scraping involves finding a list of items and extracting fields from each:

```python from parsel import Selector

sel = Selector(response.text)

for product in sel.css(".product-card"): item = { "name": product.css("h3::text").get(), "price": product.css(".price::text").get(), "url": product.css("a::attr(href)").get(), "image": product.css("img::attr(src)").get(), } print(item) ```

Handling Missing Data

Not every element will always be present. Use defaults to avoid errors:

python price = product.css(".price::text").get(default="N/A") rating = product.css(".rating::attr(data-score)").get(default="0")

Cleaning Text

Scraped text often contains extra whitespace:

```python text = product.css("p::text").get("").strip()

or with XPath

text = product.xpath("normalize-space(.//p/text())").get() ```

Next Steps

JSON Parsing - parsing JSON data from APIs
Hidden Web Data - finding data in script tags
Scrapfly Academy: HTML Parsing - interactive lesson with CSS/XPath cheatsheets