Dynamic Page Scraping
Dynamic pages use JavaScript to load content after the initial HTML is delivered. Single-page applications (SPAs) built with React, Vue, or Angular are fully dynamic - the HTML source contains almost no data until JavaScript runs.
Interactive lesson
This topic is covered in the Scrapfly Academy: Dynamic Scraping lesson.
Identifying Dynamic Pages
A page is dynamic if the data you want is not in the initial HTML response. To check:
- View the page source (Ctrl+U) - if the content is missing, it is loaded by JavaScript
- Disable JavaScript in your browser - if the page goes blank or loses content, it is dynamic
- Compare
curl https://example.comoutput with what the browser shows - differences indicate JavaScript rendering
Two Approaches
1. Use a Headless Browser
The most straightforward approach: let a real browser execute the JavaScript.
```python from playwright.sync_api import sync_playwright
with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://example.com/spa")
# Wait for the data to load
page.wait_for_selector(".data-loaded")
# Now the page has the same content as in a real browser
content = page.content()
# Parse content with parsel, beautifulsoup, etc.
browser.close()
```
See Headless Browsers for a detailed guide on Playwright, Puppeteer, and Selenium.
2. Find the Hidden API
Most dynamic pages fetch data from a backend API. If you can find that API endpoint, you can call it directly with an HTTP client - much faster than using a browser.
Use browser DevTools (Network tab) to inspect the requests the page makes:
- Open DevTools (F12) and go to the Network tab
- Filter by "XHR" or "Fetch" to see API calls
- Find the request that returns the data you need
- Replicate that request with an HTTP client
```python import httpx
Instead of rendering the page with a browser,
call the API directly
response = httpx.get( "https://example.com/api/products", headers={"accept": "application/json"}, ) products = response.json() ```
This technique is covered in depth in the Scrapfly Academy: Hidden API Scraping and Reverse Engineering lessons.
Which Approach to Choose?
| Factor | Headless Browser | Hidden API |
|---|---|---|
| Speed | Slow (seconds per page) | Fast (milliseconds) |
| Resource usage | High (browser + memory) | Low (HTTP request) |
| Reliability | Good (sees what the user sees) | Depends (API may change) |
| Setup effort | Easy (just render the page) | Harder (reverse engineer the API) |
| At scale | Expensive | Cheap |
Recommended approach: try to find the hidden API first. If it is too complex or authenticated, fall back to a headless browser.
Waiting Strategies
When using a headless browser, you need to wait for dynamic content to load. Common patterns:
```python
Wait for a specific element
page.wait_for_selector(".product-list")
Wait for network to be idle (no more API calls)
page.wait_for_load_state("networkidle")
Wait for a specific API response
with page.expect_response("/api/products") as response_info: page.goto("https://example.com/products") response = response_info.value ```
Next Steps
- Headless Browsers - detailed guide to Playwright, Puppeteer, Selenium
- Hidden Web Data - finding data in script tags and page source
- Browser Libraries - anti-detect browsers for protected dynamic sites
- Scrapfly Academy: Dynamic Scraping - interactive lesson