Skip to content

HTML

Hyper Text Markup Language is a tree-like data structure that powers the visual web. When web scraping HTML the real challenge is parsing the wanted data out of the HTML structure.

For example, let's take this simple HTML document:

<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>

We can already see just from indentation that this is a highly structured data format. It's a tree made from elements that have attributions and relationships to each other:

html tree

Knowing this HTML parsing looks much more approachable now. All we need to do to extract values from the web page is to define some rules that describe the location.

There are several ways to do that, let's take a look at the most common ones.

CSS Selectors

CSS selectors is a path language for applying css styles:

stylesheet.css
div.item .price {
    color: red;
}

In web scraping, we can use the same tech to select values to extract. For example, to get price from this HTML:

<div class="item">
    <span class="price">49.89</span>
</div>

We could use a CSS selector div.item .price to find the span element and then select its text.

Example: Python + parsel

This example uses Python and parsel to select price using CSS selectors

from parsel import Selector

html = """
<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>
"""
tree = Selector(text=html)
print(tree.css('div.item .price::text').get())
"49.89"

Here's a list of CSS selector libraries in various programming languages that are used in web scraping:

Language CSS Selector Library
Python parsel
beautifulsoup
lxml
pyquery
Go goquery
cascadia
Rust scraper
soup
PHP dom-crawler
DiDom
Ruby nokogiri
R rvest
NodeJS cheerio

XPath Selectors

XPath is a lesser-known but more powerful alternative to CSS selectors. The key advantage of XPath is that it allows to:

  • Select elements by the text value
  • Select the element's parents
  • Easily extendable

The example from before:

<div class="item">
    <span class="price">49.89</span>
</div>

In Xpath would look similar to: //div[@class="item"]//*[@class="price"]/text()

Example: Python + parsel

This example uses Python and parsel to select price using CSS selectors

from parsel import Selector

html = """
<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>
"""
tree = Selector(text=html)
print(tree.css('div.item .price::text').get())
"49.89"

Here's a list of XPath selector libraries in various programming languages that are used in web scraping:

Language XPath Library
Python parsel
lxml
Go htmlquery
gokogiri
PHP dom-crawler
DiDom
Rust sxd-xpath
Ruby nokogiri
R rvest

CSS vs XPath?

CSS selectors are more brief and simple while XPath selectors are more powerful. Why not mix them both? Many web scrapers use CSS selectors where it's possible and for more complex queries fallback to XPath.

Native Parsing

Many programming languages have HTML APIs allowing HTML parsing through programming expressions like .find(class="product") etc. While CSS or XPath selectors are the most common and standard way to parse HTML sometimes native parsing can be more performant and easier.

Native parsing can also fill in feature gaps when using CSS selectors so it pairs nicely with other HTML parsing methods.

Example: Python + Beautifulsoup

This example uses Python and beautifulsoup to select the price using beautifulsoup's .find() method.

from bs4 import BeautifulSoup

html = """
<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>
"""
soup = BeautifulSoup(html)
print(soup.find('div', {'class': 'price'}).text)
"49.89"

Here's a list of popular native HTML parsing libraries:

Language Library
Python beautifulsoup
Rust select.rs
Go goquery
PHP DiDom
Ruby nokogiri
R -

Regular Expressions

Regular expressions should be avoided when parsing HTML as they can break easily. However, to extract a one or two values regular expressions can outperform HTML tree parsing significantly. So, if the parser only needs to extract a small value like a token or product price carefully crafted regex can be significantly faster than any HTML parser.

For an example of this see javascript variable scraping