HTTP

HyperText Transfer Protocol is the foundation of the whole web and when it comes to web scraping we need some basic understanding to write successful web scrapers.

In web scraping, we use this protocol to retrieve web pages. For example, to retrieve a page in Python we'd use a script like this one:

import httpx
response = httpx.get("http://httpbin.org/html")
print(response.text)

The script above is sending a request to URL http://httpbin.org/html. In return we get a response from the server with the web page data.

The goal of the web-scraper is to send a valid HTTP request to receive response data and for the request to be valid, it has to match the server's expectations. In other words, the requests should look like they are coming from a real user using a web browser.

In this section, we'll take a look at the most important HTTP theory when it comes to web scraping.

URL

Universal Resource Location (URL) indicates the address of a web resource. It's made up from from few key parts that play different roles when it comes to web scraping:

URL:

Protocol	Host	Path	Query	Anchor
https	example.com	/path/to/resource	arg=value&arg2=value2	#anchor

Some parts of the URL can play an important role in web scraping:

Protocol - indicates whether this endpoint is using end to end encryption. When scraping we prefer unprotected endpoints because of [TLS fingerprinting].
Host - is the domain name. When crawling, it's a good practice to lock crawling behavior to specific host to avoid unecessary requests.
Query - this part indicates parameters that should match whatever format (there's no clear standard) and order the website is using to avoid being blocked.

HTTP versions

HTTP has several versions:

HTTP 1.1 - simplest version of protocol. Not used by browsers.
HTTP 2 - used by all browsers; implemented by some HTTP clients.
HTTP 3 (aka QUIC) - used by browsers when possible; HTTP client implementations are rare.

All HTTP clients support HTTP 1.1 though most of the real user web traffic is HTTP2/3. So, to avoid being blocked when web scraping HTTP 2 or later should be used.

Recommended HTTP Clients

language	client	highlights
Python	httpx	feature-rich, http2, async, http-proxy, socks-proxy
	requests	ease of use, http-proxy, socks-proxy
Go	req	feature-rich, http2, http3, http-proxy, socks-proxy
	resty	feature-rich, http2, http-proxy
Ruby	typhoeus	uses-curl, concurrency
	faraday	ease-of-use, can adapt typhoeus
PHP	guzzle	uses-curl, concurrency
	symfony-http	uses-curl, concurrency
R	crul	uses-curl, concurrency
	httr	uses-curl, concurrency
Nim	puppy	uses-curl winhttp or appkit, http-proxy
Rust	hurl	uses-curl
NodeJS	axios	feature-rich, async, http-proxy, socks-proxy

* uses-curl - all libraries that use curl inherit it's features like http/socks proxies etc.

Request Types

There are several types of requests but in web scraping we'll mostly be working with GET, POST and HEAD requests:

GETPOSTHEAD

The most common request type - request a resource.
This is the bread and butter of web scraping as usually, all we need is to download the page contents.

response = httpx.get("http://httpbin.org/get")
print(response.status_code) 
print(response.text)

Send some data.
Used when scraping forms, search/hidden APIs - any functionality where the client needs to provide more details than possible through URL parameters.

# when posting data we need to indicate the data type using the Content-Type header
httpx.post(
    url="http://httpbin.org/post",
    body='''{"search": "foo"}''',
    headers={"Content-Type": "application/json"}
)

Note that forst POST requests two types of content types are most popular: - JSON - Form Data

# note that most HTTP clients have shortcuts for 2 most common content types:
# JSON:
httpx.post(
    url="http://httpbin.org/post",
    json={"search": "foo"},
)
# Form Data:
httpx.post(
    url="http://httpbin.org/post",
    data={"search": "foo"},
)

Request metadata.
Used to estimate scrape target as it only retrieves the pages meta information like when was the last time it was modified.

response = httpx.head("https://pypi.org/project/httpx/")

# Head responses have headers but no body:
response.text
# ""
response.headers
# Headers({'connection': 'keep-alive', 'conten...

# this useful for checking whether the page has changed since the last scrape. Like checking the etag header:
response.headers['etag']
# '"snUsTU1tT7lzmRCIehAHpQ"'

Request Headers

With every request, we can also provide optional metadata called headers. Most HTTP clients configure default headers for us but there are a few important ones:

Referer and Origin headers indicate the relative origin of the requests.
Usually Origin is the website's domain name and Referer is the previous page.
Accept and Accept- prefixed headers indicate what sort of response the client accepts.
Generally, this should match common values used in web browsers
User-Agent and Sc- prefixed headers indicate who's making the request. This should always be set to the values of a common web browser.
X- prefixed headers are non-standard and important to replicate in web scrapers as they can contain authorization tokens (like X-CSRF) or other required details.

Note that generally, we want to replicate the headers and their order in the way a web browser is sending to avoid being blocked. See the devtools section for more.

Response Status Codes

Once we send out our requests we will either receive a success/fail response or a timeout error will be raised (the server can simply ignore us).

Each response has a unique status code that signifies failure, success or alternative action request:

200 range responses generally mean success!
Note that 200 can be a lie when scraping websites under anti-scraping protection services. Often the code is 200 but the HTML body indicates that the client is being blocked and holds no real content.
300 range responses mean redirection.
This means the page location has changed. Most HTTP clients follow redirects automatically.
400 range responses mean blocking or invalid request.
This could mean the web server is blocking the scraper or there's simply an error in scrapers request logic like a missing header, cookies or a bad URL.
500 range response mean blocking or server outage.
Generally 500 means server is can't complete the request either because of internal technical difficulties or because it can't understand the request or the client is being blocked.

Response Headers

Just like request headers, response headers contain metadata about the response. Here are some of the most commonly encountered response headers in web scraping:

Content-Type specifies what sort of response we've received. Usually this is text/html for html documents or application/json for json.
Set-Cookie header requests the client to save some cookies. Many HTTP clients like Python's httpx.Client() will save/send/manage cookies automatically.
Location header indicates redirect location when for 30x response status codes
X- prefixed headers are non-standard and can contain important details based on website.