Static Page Scraping
Static pages are the simplest type of web page to scrape. The HTML content is delivered directly in the HTTP response without needing JavaScript to render it. You can verify if a page is static by disabling JavaScript in your browser. If the content still appears, it is a static page.
Interactive lesson
This topic is covered in depth in the Scrapfly Academy: Static Scraping lesson with live code examples.
What You Need
Static scraping requires just two tools:
- HTTP client to fetch the page content
- HTML parser to extract the data from the response
For HTTP clients, see the Languages overview for a comparison across programming languages. For HTML parsing, see the HTML Parsing lesson.
Making Your First Request
The simplest scrape fetches a URL and reads the response:
Using httpx (recommended for its HTTP/2 support):
```python import httpx
Create a client with browser-like defaults
client = httpx.Client( http2=True, follow_redirects=True, headers={ "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8", "accept-language": "en-US,en;q=0.9", }, )
response = client.get("https://web-scraping.dev/products") print(response.status_code) # 200 print(response.text[:500]) # first 500 chars of HTML ```
Using axios:
```javascript const axios = require('axios');
const response = await axios.get('https://web-scraping.dev/products', { headers: { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'accept': 'text/html,application/xhtml+xml', }, }); console.log(response.status); // 200 console.log(response.data.slice(0, 500)); ```
Using req:
```go import "github.com/imroc/req/v3"
client := req.C(). SetUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
resp, err := client.R().Get("https://web-scraping.dev/products") fmt.Println(resp.StatusCode) fmt.Println(resp.String()[:500]) ```
Common Challenges
Headers
Websites check HTTP headers to identify scrapers. At minimum, you should set:
- User-Agent - identifies your "browser". Use a real browser user-agent string.
- Accept - what content types you accept. Use
text/htmlfor web pages. - Accept-Language - what language you want. Use
en-US,en;q=0.9.
Missing or unusual headers is the #1 reason static scrapers get blocked.
Cookies and Sessions
HTTP is stateless, so websites use cookies to track sessions. Some pages require specific cookies to load correctly (for example, accepting a cookie consent banner, selecting a region, or being logged in).
Use a session/client object that persists cookies across requests:
```python import httpx
client = httpx.Client()
First request sets cookies
client.get("https://example.com")
Subsequent requests send cookies automatically
response = client.get("https://example.com/products") ```
HTTP/2
Many modern websites use HTTP/2. If you scrape with HTTP/1.1, the difference in the protocol can flag your request as non-browser traffic. Use an HTTP client that supports HTTP/2:
Anti-Bot Blocking
If a website uses anti-bot protections like Cloudflare or DataDome, standard HTTP clients will get blocked regardless of headers. You will need either:
- A TLS fingerprint library like curl-cffi or primp
- A headless browser
- A web scraping API like Scrapfly that handles bypass automatically
When to Use Static Scraping
| Use When | Do Not Use When |
|---|---|
| Page content is in the HTML source | Content loads via JavaScript (SPAs, React, etc.) |
| No login or complex session required | Website uses heavy anti-bot protections |
| You need high speed and low resources | Page requires clicking or scrolling to load data |
For JavaScript-rendered pages, see Dynamic Scraping. For anti-bot bypass, see Anti-Bot Protections.
Next Steps
- HTML Parsing - extracting data from the HTML you retrieved
- Hidden Web Data - finding data in script tags and microdata
- Scrapfly Academy: Static Scraping - interactive lesson with more examples