Dynamic Page Scraping

Dynamic pages use JavaScript to load content after the initial HTML is delivered. Single-page applications (SPAs) built with React, Vue, or Angular are fully dynamic - the HTML source contains almost no data until JavaScript runs.

Interactive lesson

This topic is covered in the Scrapfly Academy: Dynamic Scraping lesson.

Identifying Dynamic Pages

A page is dynamic if the data you want is not in the initial HTML response. To check:

View the page source (Ctrl+U) - if the content is missing, it is loaded by JavaScript
Disable JavaScript in your browser - if the page goes blank or loses content, it is dynamic
Compare curl https://example.com output with what the browser shows - differences indicate JavaScript rendering

Two Approaches

1. Use a Headless Browser

The most straightforward approach: let a real browser execute the JavaScript.

```python from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://example.com/spa")

# Wait for the data to load
page.wait_for_selector(".data-loaded")

# Now the page has the same content as in a real browser
content = page.content()
# Parse content with parsel, beautifulsoup, etc.

browser.close()

```

See Headless Browsers for a detailed guide on Playwright, Puppeteer, and Selenium.

2. Find the Hidden API

Most dynamic pages fetch data from a backend API. If you can find that API endpoint, you can call it directly with an HTTP client - much faster than using a browser.

Use browser DevTools (Network tab) to inspect the requests the page makes:

Open DevTools (F12) and go to the Network tab
Filter by "XHR" or "Fetch" to see API calls
Find the request that returns the data you need
Replicate that request with an HTTP client

```python import httpx

Instead of rendering the page with a browser,

call the API directly

response = httpx.get( "https://example.com/api/products", headers={"accept": "application/json"}, ) products = response.json() ```

This technique is covered in depth in the Scrapfly Academy: Hidden API Scraping and Reverse Engineering lessons.

Which Approach to Choose?

Factor	Headless Browser	Hidden API
Speed	Slow (seconds per page)	Fast (milliseconds)
Resource usage	High (browser + memory)	Low (HTTP request)
Reliability	Good (sees what the user sees)	Depends (API may change)
Setup effort	Easy (just render the page)	Harder (reverse engineer the API)
At scale	Expensive	Cheap

Recommended approach: try to find the hidden API first. If it is too complex or authenticated, fall back to a headless browser.

Waiting Strategies

When using a headless browser, you need to wait for dynamic content to load. Common patterns:

```python

Wait for a specific element

page.wait_for_selector(".product-list")

Wait for network to be idle (no more API calls)

page.wait_for_load_state("networkidle")

Wait for a specific API response

with page.expect_response("/api/products") as response_info: page.goto("https://example.com/products") response = response_info.value ```

Next Steps

Headless Browsers - detailed guide to Playwright, Puppeteer, Selenium
Hidden Web Data - finding data in script tags and page source
Browser Libraries - anti-detect browsers for protected dynamic sites
Scrapfly Academy: Dynamic Scraping - interactive lesson