HTML Parsing
After retrieving a web page, you need to extract the specific data you want from the HTML. This is called parsing. There are three main approaches, and the right choice depends on the complexity of your extraction.
Interactive lesson
This topic is covered in depth in the Scrapfly Academy: HTML Parsing lesson with live examples.
Three Approaches
1. CSS Selectors
The most common and recommended approach. CSS selectors use the same syntax as CSS stylesheets to target HTML elements.
```python from parsel import Selector
html = "
Widget
$10Select by class
price = sel.css(".price::text").get() # "$10"
Select by element and attribute
link = sel.css("h3 a::attr(href)").get() # "/item/1"
Select multiple elements
all_prices = sel.css(".price::text").getall() # ["$10"] ```
CSS selectors are easy to learn and cover 90% of scraping needs. When you inspect an element in browser DevTools, you can right-click and "Copy selector" to get the CSS selector for it.
2. XPath
A more powerful query language that supports conditions, text functions, and ancestor/descendant traversal that CSS selectors cannot do.
```python from parsel import Selector
sel = Selector(html)
XPath equivalent of CSS selectors
price = sel.xpath("//span[@class='price']/text()").get() # "$10"
XPath can do things CSS cannot:
Select elements containing specific text
potions = sel.xpath("//a[contains(text(), 'Potion')]/@href").getall()
Select parent of an element
parent = sel.xpath("//span[@class='price']/..").get()
Select by position
first_product = sel.xpath("(//div[@class='product'])[1]").get() ```
Use XPath when CSS selectors are not expressive enough, for example when you need to select based on text content or navigate to parent elements.
3. Object-Based Parsing
Libraries like BeautifulSoup turn HTML into a tree of Python objects you can navigate programmatically. This is useful for algorithmic traversal but is slower than selector-based approaches.
```python from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
Navigate the tree
product = soup.find("div", class_="product") title = product.find("h3").get_text() price = product.find("span", class_="price").string ```
Recommended Priority
- Start with CSS selectors - simple, fast, covers most cases
- Use XPath as fallback - for complex selections CSS cannot handle
- Use object parsers for algorithms - when you need programmatic traversal
Libraries by Language
| Language | CSS Selectors | XPath | Object-Based |
|---|---|---|---|
| Python | parsel, beautifulsoup | parsel, lxml | beautifulsoup |
| JavaScript | cheerio | - | jsdom |
| Go | goquery | htmlquery | - |
| PHP | domcrawler | domcrawler | simple-html-dom |
| Ruby | nokogiri | nokogiri | - |
For speed-sensitive applications, selectolax is up to 30x faster than BeautifulSoup for Python, and lxml is significantly faster than html.parser.
Common Patterns
Extracting a List of Items
Most scraping involves finding a list of items and extracting fields from each:
```python from parsel import Selector
sel = Selector(response.text)
for product in sel.css(".product-card"): item = { "name": product.css("h3::text").get(), "price": product.css(".price::text").get(), "url": product.css("a::attr(href)").get(), "image": product.css("img::attr(src)").get(), } print(item) ```
Handling Missing Data
Not every element will always be present. Use defaults to avoid errors:
python
price = product.css(".price::text").get(default="N/A")
rating = product.css(".rating::attr(data-score)").get(default="0")
Cleaning Text
Scraped text often contains extra whitespace:
```python text = product.css("p::text").get("").strip()
or with XPath
text = product.xpath("normalize-space(.//p/text())").get() ```
Next Steps
- JSON Parsing - parsing JSON data from APIs
- Hidden Web Data - finding data in script tags
- Scrapfly Academy: HTML Parsing - interactive lesson with CSS/XPath cheatsheets