HTML

Hyper Text Markup Language is a tree-like data structure that powers the visual web. When web scraping HTML the real challenge is parsing the wanted data out of the HTML structure.

For example, let's take this simple HTML document:

<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>

We can already see just from indentation that this is a highly structured data format. It's a tree made from elements that have attributions and relationships to each other:

Knowing this HTML parsing looks much more approachable now. All we need to do to extract values from the web page is to define some rules that describe the location.

There are several ways to do that, let's take a look at the most common ones.

CSS Selectors

CSS selectors is a path language for applying css styles:

stylesheet.css

div.item .price {
    color: red;
}

In web scraping, we can use the same tech to select values to extract. For example, to get price from this HTML:

<div class="item">
    <span class="price">49.89</span>
</div>

We could use a CSS selector div.item .price to find the span element and then select its text.

Example: Python + parsel

This example uses Python and parsel to select price using CSS selectors

from parsel import Selector

html = """
<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>
"""
tree = Selector(text=html)
print(tree.css('div.item .price::text').get())
"49.89"

Here's a list of CSS selector libraries in various programming languages that are used in web scraping:

Language	CSS Selector Library
Python	parsel beautifulsoup lxml pyquery
Go	goquery cascadia
Rust	scraper soup
PHP	dom-crawler DiDom
Ruby	nokogiri
R	rvest
NodeJS	cheerio

XPath Selectors

XPath is a lesser-known but more powerful alternative to CSS selectors. The key advantage of XPath is that it allows to:

Select elements by the text value
Select the element's parents
Easily extendable

The example from before:

<div class="item">
    <span class="price">49.89</span>
</div>

In Xpath would look similar to: //div[@class="item"]//*[@class="price"]/text()

Example: Python + parsel

This example uses Python and parsel to select price using CSS selectors

from parsel import Selector

html = """
<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>
"""
tree = Selector(text=html)
print(tree.css('div.item .price::text').get())
"49.89"

Here's a list of XPath selector libraries in various programming languages that are used in web scraping:

Language	XPath Library
Python	parsel lxml
Go	htmlquery gokogiri
PHP	dom-crawler DiDom
Rust	sxd-xpath
Ruby	nokogiri
R	rvest

CSS vs XPath?

CSS selectors are more brief and simple while XPath selectors are more powerful. Why not mix them both? Many web scrapers use CSS selectors where it's possible and for more complex queries fallback to XPath.

Native Parsing

Many programming languages have HTML APIs allowing HTML parsing through programming expressions like .find(class="product") etc. While CSS or XPath selectors are the most common and standard way to parse HTML sometimes native parsing can be more performant and easier.

Native parsing can also fill in feature gaps when using CSS selectors so it pairs nicely with other HTML parsing methods.

Example: Python + Beautifulsoup

This example uses Python and beautifulsoup to select the price using beautifulsoup's .find() method.

from bs4 import BeautifulSoup

html = """
<head>
  <title>shop</title>
<head>
<body>
  <div class="item">
    <div title="phone">Smartphone 2</div>
    <div class="price">49.89</div>
  </div>
</body>
"""
soup = BeautifulSoup(html)
print(soup.find('div', {'class': 'price'}).text)
"49.89"

Here's a list of popular native HTML parsing libraries:

Language	Library
Python	beautifulsoup
Rust	select.rs
Go	goquery
PHP	DiDom
Ruby	nokogiri
R	-

Regular Expressions

Regular expressions should be avoided when parsing HTML as they can break easily. However, to extract a one or two values regular expressions can outperform HTML tree parsing significantly. So, if the parser only needs to extract a small value like a token or product price carefully crafted regex can be significantly faster than any HTML parser.

For an example of this see javascript variable scraping