Skip to content

Dynamic Page Scraping

Dynamic pages use JavaScript to load content after the initial HTML is delivered. Single-page applications (SPAs) built with React, Vue, or Angular are fully dynamic - the HTML source contains almost no data until JavaScript runs.

Interactive lesson

This topic is covered in the Scrapfly Academy: Dynamic Scraping lesson.

Identifying Dynamic Pages

A page is dynamic if the data you want is not in the initial HTML response. To check:

  1. View the page source (Ctrl+U) - if the content is missing, it is loaded by JavaScript
  2. Disable JavaScript in your browser - if the page goes blank or loses content, it is dynamic
  3. Compare curl https://example.com output with what the browser shows - differences indicate JavaScript rendering

Two Approaches

1. Use a Headless Browser

The most straightforward approach: let a real browser execute the JavaScript.

```python from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://example.com/spa")

# Wait for the data to load
page.wait_for_selector(".data-loaded")

# Now the page has the same content as in a real browser
content = page.content()
# Parse content with parsel, beautifulsoup, etc.

browser.close()

```

See Headless Browsers for a detailed guide on Playwright, Puppeteer, and Selenium.

2. Find the Hidden API

Most dynamic pages fetch data from a backend API. If you can find that API endpoint, you can call it directly with an HTTP client - much faster than using a browser.

Use browser DevTools (Network tab) to inspect the requests the page makes:

  1. Open DevTools (F12) and go to the Network tab
  2. Filter by "XHR" or "Fetch" to see API calls
  3. Find the request that returns the data you need
  4. Replicate that request with an HTTP client

```python import httpx

Instead of rendering the page with a browser,

call the API directly

response = httpx.get( "https://example.com/api/products", headers={"accept": "application/json"}, ) products = response.json() ```

This technique is covered in depth in the Scrapfly Academy: Hidden API Scraping and Reverse Engineering lessons.

Which Approach to Choose?

Factor Headless Browser Hidden API
Speed Slow (seconds per page) Fast (milliseconds)
Resource usage High (browser + memory) Low (HTTP request)
Reliability Good (sees what the user sees) Depends (API may change)
Setup effort Easy (just render the page) Harder (reverse engineer the API)
At scale Expensive Cheap

Recommended approach: try to find the hidden API first. If it is too complex or authenticated, fall back to a headless browser.

Waiting Strategies

When using a headless browser, you need to wait for dynamic content to load. Common patterns:

```python

Wait for a specific element

page.wait_for_selector(".product-list")

Wait for network to be idle (no more API calls)

page.wait_for_load_state("networkidle")

Wait for a specific API response

with page.expect_response("/api/products") as response_info: page.goto("https://example.com/products") response = response_info.value ```

Next Steps

Was this page helpful?