HTML
Hyper Text Markup Language is a tree-like data structure that powers the visual web. When web scraping HTML the real challenge is parsing the wanted data out of the HTML structure.
For example, let's take this simple HTML document:
<head>
<title>shop</title>
<head>
<body>
<div class="item">
<div title="phone">Smartphone 2</div>
<div class="price">49.89</div>
</div>
</body>
We can already see just from indentation that this is a highly structured data format. It's a tree made from elements that have attributions and relationships to each other:
Knowing this HTML parsing looks much more approachable now. All we need to do to extract values from the web page is to define some rules that describe the location.
There are several ways to do that, let's take a look at the most common ones.
CSS Selectors
CSS selectors is a path language for applying css styles:
div.item .price {
color: red;
}
In web scraping, we can use the same tech to select values to extract. For example, to get price from this HTML:
<div class="item">
<span class="price">49.89</span>
</div>
We could use a CSS selector div.item .price
to find the span
element and then select its text.
Example: Python + parsel
This example uses Python and parsel to select price using CSS selectors
from parsel import Selector
html = """
<head>
<title>shop</title>
<head>
<body>
<div class="item">
<div title="phone">Smartphone 2</div>
<div class="price">49.89</div>
</div>
</body>
"""
tree = Selector(text=html)
print(tree.css('div.item .price::text').get())
"49.89"
Here's a list of CSS selector libraries in various programming languages that are used in web scraping:
Language | CSS Selector Library |
---|---|
Python | parsel beautifulsoup lxml pyquery |
Go | goquery cascadia |
Rust | scraper soup |
PHP | dom-crawler DiDom |
Ruby | nokogiri |
R | rvest |
NodeJS | cheerio |
XPath Selectors
XPath is a lesser-known but more powerful alternative to CSS selectors. The key advantage of XPath is that it allows to:
- Select elements by the text value
- Select the element's parents
- Easily extendable
The example from before:
<div class="item">
<span class="price">49.89</span>
</div>
In Xpath would look similar to: //div[@class="item"]//*[@class="price"]/text()
Example: Python + parsel
This example uses Python and parsel to select price using CSS selectors
from parsel import Selector
html = """
<head>
<title>shop</title>
<head>
<body>
<div class="item">
<div title="phone">Smartphone 2</div>
<div class="price">49.89</div>
</div>
</body>
"""
tree = Selector(text=html)
print(tree.css('div.item .price::text').get())
"49.89"
Here's a list of XPath selector libraries in various programming languages that are used in web scraping:
Language | XPath Library |
---|---|
Python | parsel lxml |
Go | htmlquery gokogiri |
PHP | dom-crawler DiDom |
Rust | sxd-xpath |
Ruby | nokogiri |
R | rvest |
CSS vs XPath?
CSS selectors are more brief and simple while XPath selectors are more powerful. Why not mix them both? Many web scrapers use CSS selectors where it's possible and for more complex queries fallback to XPath.
Native Parsing
Many programming languages have HTML APIs allowing HTML parsing through programming expressions like .find(class="product")
etc. While CSS or XPath selectors are the most common and standard way to parse HTML sometimes native parsing can be more performant and easier.
Native parsing can also fill in feature gaps when using CSS selectors so it pairs nicely with other HTML parsing methods.
Example: Python + Beautifulsoup
This example uses Python and beautifulsoup to select the price using beautifulsoup's .find()
method.
from bs4 import BeautifulSoup
html = """
<head>
<title>shop</title>
<head>
<body>
<div class="item">
<div title="phone">Smartphone 2</div>
<div class="price">49.89</div>
</div>
</body>
"""
soup = BeautifulSoup(html)
print(soup.find('div', {'class': 'price'}).text)
"49.89"
Here's a list of popular native HTML parsing libraries:
Language | Library |
---|---|
Python | beautifulsoup |
Rust | select.rs |
Go | goquery |
PHP | DiDom |
Ruby | nokogiri |
R | - |
Regular Expressions
Regular expressions should be avoided when parsing HTML as they can break easily. However, to extract a one or two values regular expressions can outperform HTML tree parsing significantly. So, if the parser only needs to extract a small value like a token or product price carefully crafted regex can be significantly faster than any HTML parser.
For an example of this see javascript variable scraping