Skip to content

Static Page Scraping

Static pages are the simplest type of web page to scrape. The HTML content is delivered directly in the HTTP response without needing JavaScript to render it. You can verify if a page is static by disabling JavaScript in your browser. If the content still appears, it is a static page.

Interactive lesson

This topic is covered in depth in the Scrapfly Academy: Static Scraping lesson with live code examples.

What You Need

Static scraping requires just two tools:

  1. HTTP client to fetch the page content
  2. HTML parser to extract the data from the response

For HTTP clients, see the Languages overview for a comparison across programming languages. For HTML parsing, see the HTML Parsing lesson.

Making Your First Request

The simplest scrape fetches a URL and reads the response:

Using httpx (recommended for its HTTP/2 support):

```python import httpx

Create a client with browser-like defaults

client = httpx.Client( http2=True, follow_redirects=True, headers={ "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8", "accept-language": "en-US,en;q=0.9", }, )

response = client.get("https://web-scraping.dev/products") print(response.status_code) # 200 print(response.text[:500]) # first 500 chars of HTML ```

Using axios:

```javascript const axios = require('axios');

const response = await axios.get('https://web-scraping.dev/products', { headers: { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'accept': 'text/html,application/xhtml+xml', }, }); console.log(response.status); // 200 console.log(response.data.slice(0, 500)); ```

Using req:

```go import "github.com/imroc/req/v3"

client := req.C(). SetUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")

resp, err := client.R().Get("https://web-scraping.dev/products") fmt.Println(resp.StatusCode) fmt.Println(resp.String()[:500]) ```

Common Challenges

Headers

Websites check HTTP headers to identify scrapers. At minimum, you should set:

  • User-Agent - identifies your "browser". Use a real browser user-agent string.
  • Accept - what content types you accept. Use text/html for web pages.
  • Accept-Language - what language you want. Use en-US,en;q=0.9.

Missing or unusual headers is the #1 reason static scrapers get blocked.

Cookies and Sessions

HTTP is stateless, so websites use cookies to track sessions. Some pages require specific cookies to load correctly (for example, accepting a cookie consent banner, selecting a region, or being logged in).

Use a session/client object that persists cookies across requests:

```python import httpx

client = httpx.Client()

First request sets cookies

client.get("https://example.com")

Subsequent requests send cookies automatically

response = client.get("https://example.com/products") ```

HTTP/2

Many modern websites use HTTP/2. If you scrape with HTTP/1.1, the difference in the protocol can flag your request as non-browser traffic. Use an HTTP client that supports HTTP/2:

Anti-Bot Blocking

If a website uses anti-bot protections like Cloudflare or DataDome, standard HTTP clients will get blocked regardless of headers. You will need either:

When to Use Static Scraping

Use When Do Not Use When
Page content is in the HTML source Content loads via JavaScript (SPAs, React, etc.)
No login or complex session required Website uses heavy anti-bot protections
You need high speed and low resources Page requires clicking or scrolling to load data

For JavaScript-rendered pages, see Dynamic Scraping. For anti-bot bypass, see Anti-Bot Protections.

Next Steps

Was this page helpful?