Languages
For web scraping to be possible we only need two types of tools: HTTP client and HTML parser. Most programming languages have libraries for both however, some have better existing tools than others.
Which language is the best?
Web scraping is a data subject so naturally, languages used in data programming are a great fit. Additionally, since the scaling bottleneck is IO-blocking (e.g. waiting for request to complete), features like asynchronous support or easy threading are very valuable for scaling up web scrapers.
Python is the most popular language used for web scraping as it's a great data language with many great built-in and community tools used in web scraping. Javascript is becoming quite popular too through the virtue of web use.
That being said, almost any programming language can be used for web scraping with great success as long as HTTP client and HTML parser libraries are available.
HTTP Clients
For HTTP clients, we need 3 important features:
- HTTP v2+ support - as most real world traffic goes through http2 or http3 if we scrape using http1 we stand out and are easy to be blocked.
- Asynchronous support - the biggest scaling problem in web scraping is IO blocking, so asynchronous programming or accessible threading is important for scaling up web scrapers.
- Stability - the web is huge and complex - there are so many things that can go wrong. So, having a client that follows RFC standards and behaves as closely as a real web browser will prevent scraper from being blocked.
Based on these 3 virtues, here's an ordered list of HTTP clients in popular programming languages:
language | client | highlights |
---|---|---|
Python | httpx | feature-rich, http2, async, http-proxy, socks-proxy |
requests | ease of use, http-proxy, socks-proxy | |
Go | req | feature-rich, http2, http3, http-proxy, socks-proxy |
resty | feature-rich, http2, http-proxy | |
Ruby | typhoeus | uses-curl, concurrency |
faraday | ease-of-use, can adapt typhoeus | |
PHP | guzzle | uses-curl, concurrency |
symfony-http | uses-curl, concurrency | |
R | crul | uses-curl, concurrency |
httr | uses-curl, concurrency | |
Nim | puppy | uses-curl winhttp or appkit, http-proxy |
Rust | hurl | uses-curl |
NodeJS | axios | feature-rich, async, http-proxy, socks-proxy |
* uses-curl - all libraries that use curl inherit it's features like http/socks proxies etc.
HTML Parsers
Not all web scrapers work with HTML but generally, we need some HTML parsing and most programming languages have some sort of XML/HTML parser available. However, there are a few important features we need to look out for:
- CSS selectors - is the most common way to parse HTML and XML documents. It's the same language used to select elements to apply css styles.
- XPath selectors - like CSS selectors but significantly more powerful. You want XPath if you're working with heavy HTML pages.
- Speed, Stability and extras.
Based on these virtues, here's an ordered list of HTML parsing libraries in popular programming languages.
Language | XPath Library |
---|---|
Python | parsel lxml |
Go | htmlquery gokogiri |
PHP | dom-crawler DiDom |
Rust | sxd-xpath |
Ruby | nokogiri |
R | rvest |
Language | CSS Selector Library |
---|---|
Python | parsel beautifulsoup lxml pyquery |
Go | goquery cascadia |
Rust | scraper soup |
PHP | dom-crawler DiDom |
Ruby | nokogiri |
R | rvest |
NodeJS | cheerio |
JSON Parsers
Modern web scraping scrapes JSON almost as often as HTML these days and every language has JSON-like native structure (hashtables, dictionaries etc.), so parsing JSON is rarely a note worthy challenge. However, there are a few powerful tools that should not be overlooked:
-
JMESPath
Powerful path language (like XPath or CSS selectors) for JSON. Very popular and has implementations in most languages used in web scraping. -
JSONPath
XPath-like path language for JSON with key ability to select any descendant values (like XPath's//
). This is a great tool for parsing big, heavily nested JSON datasets. - jq
The most popular json query language and util. The domain specific language that can be difficult to learn but it's very powerful. Unfortunately there aren't many many client library implementations - it's more of a standalone tool. See also jqt
There are many more JSON parsing libraries and tools with various extra features like type validation etc but these 3 are the most popular ones used in web scraping.