Skip to content

Languages

For web scraping to be possible we only need two types of tools: HTTP client and HTML parser. Most programming languages have libraries for both however, some have better existing tools than others.

Which language is the best?

Web scraping is a data subject so naturally, languages used in data programming are a great fit. Additionally, since the scaling bottleneck is IO-blocking (e.g. waiting for request to complete), features like asynchronous support or easy threading are very valuable for scaling up web scrapers.

Python is the most popular language used for web scraping as it's a great data language with many great built-in and community tools used in web scraping. Javascript is becoming quite popular too through the virtue of web use.

That being said, almost any programming language can be used for web scraping with great success as long as HTTP client and HTML parser libraries are available.

HTTP Clients

For HTTP clients, we need 3 important features:

  1. HTTP v2+ support - as most real world traffic goes through http2 or http3 if we scrape using http1 we stand out and are easy to be blocked.
  2. Asynchronous support - the biggest scaling problem in web scraping is IO blocking, so asynchronous programming or accessible threading is important for scaling up web scrapers.
  3. Stability - the web is huge and complex - there are so many things that can go wrong. So, having a client that follows RFC standards and behaves as closely as a real web browser will prevent scraper from being blocked.

Based on these 3 virtues, here's an ordered list of HTTP clients in popular programming languages:

language client highlights
Python httpx feature-rich, http2, async, http-proxy, socks-proxy
requests ease of use, http-proxy, socks-proxy
Go req feature-rich, http2, http3, http-proxy, socks-proxy
resty feature-rich, http2, http-proxy
Ruby typhoeus uses-curl, concurrency
faraday ease-of-use, can adapt typhoeus
PHP guzzle uses-curl, concurrency
symfony-http uses-curl, concurrency
R crul uses-curl, concurrency
httr uses-curl, concurrency
Nim puppy uses-curl winhttp or appkit, http-proxy
Rust hurl uses-curl
NodeJS axios feature-rich, async, http-proxy, socks-proxy

* uses-curl - all libraries that use curl inherit it's features like http/socks proxies etc.

HTML Parsers

Not all web scrapers work with HTML but generally, we need some HTML parsing and most programming languages have some sort of XML/HTML parser available. However, there are a few important features we need to look out for:

  1. CSS selectors - is the most common way to parse HTML and XML documents. It's the same language used to select elements to apply css styles.
  2. XPath selectors - like CSS selectors but significantly more powerful. You want XPath if you're working with heavy HTML pages.
  3. Speed, Stability and extras.

Based on these virtues, here's an ordered list of HTML parsing libraries in popular programming languages.

Language XPath Library
Python parsel
lxml
Go htmlquery
gokogiri
PHP dom-crawler
DiDom
Rust sxd-xpath
Ruby nokogiri
R rvest
Language CSS Selector Library
Python parsel
beautifulsoup
lxml
pyquery
Go goquery
cascadia
Rust scraper
soup
PHP dom-crawler
DiDom
Ruby nokogiri
R rvest
NodeJS cheerio

JSON Parsers

Modern web scraping scrapes JSON almost as often as HTML these days and every language has JSON-like native structure (hashtables, dictionaries etc.), so parsing JSON is rarely a note worthy challenge. However, there are a few powerful tools that should not be overlooked:

  • JMESPath
    Powerful path language (like XPath or CSS selectors) for JSON. Very popular and has implementations in most languages used in web scraping.

  • JSONPath
    XPath-like path language for JSON with key ability to select any descendant values (like XPath's //). This is a great tool for parsing big, heavily nested JSON datasets.

  • jq
    The most popular json query language and util. The domain specific language that can be difficult to learn but it's very powerful. Unfortunately there aren't many many client library implementations - it's more of a standalone tool. See also jqt

There are many more JSON parsing libraries and tools with various extra features like type validation etc but these 3 are the most popular ones used in web scraping.

Was this page helpful?