Skip to content

Headless Browsers for Scraping

Headless browsers are web browsers running without a visible UI. They execute JavaScript, render CSS, and behave like a real browser, but are controlled programmatically. For web scraping, this means you can scrape pages that load content dynamically with JavaScript.

Interactive lesson

This topic is covered in the Scrapfly Academy: Headless Browsers lesson.

When Do You Need a Browser?

Scenario HTTP Client Headless Browser
Static HTML pages Yes Overkill
JavaScript-rendered content (SPAs, React, Vue) No Yes
Pages that require clicking, scrolling, or typing No Yes
Heavy anti-bot protections Sometimes (with TLS fingerprinting) Yes (with anti-detect tools)
API/JSON endpoints Yes No
High-speed, high-volume scraping Yes Too slow

Rule of thumb: try an HTTP client first. Only use a browser when the content requires JavaScript to load.

The Big Three

Three browser automation tools dominate web scraping. See the full Browser Automation comparison for details.

Playwright

The most modern and feature-rich option. Supports Chrome, Firefox, and Safari across Python, JavaScript, Java, and .NET.

```python from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://web-scraping.dev/testimonials")

# Wait for dynamic content to load
page.wait_for_selector(".testimonial")

# Extract data
testimonials = page.query_selector_all(".testimonial")
for t in testimonials:
    author = t.query_selector(".author").text_content()
    text = t.query_selector(".text").text_content()
    print(f"{author}: {text}")

browser.close()

```

Puppeteer

JavaScript/Node.js only, but has the largest scraping community and plugin ecosystem.

```javascript const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({ headless: true }); const page = await browser.newPage(); await page.goto('https://web-scraping.dev/testimonials');

await page.waitForSelector('.testimonial');

const data = await page.$$eval('.testimonial', (elements) => elements.map((el) => ({ author: el.querySelector('.author').textContent, text: el.querySelector('.text').textContent, })) ); console.log(data);

await browser.close(); ```

For stealth capabilities, use puppeteer-extra with the stealth plugin.

Selenium

The oldest and most mature option with the biggest community. Supports the widest range of languages.

```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome() driver.get("https://web-scraping.dev/testimonials")

WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".testimonial")) )

testimonials = driver.find_elements(By.CSS_SELECTOR, ".testimonial") for t in testimonials: author = t.find_element(By.CSS_SELECTOR, ".author").text text = t.find_element(By.CSS_SELECTOR, ".text").text print(f"{author}: {text}")

driver.quit() ```

Chrome DevTools Protocol (CDP)

All three tools control browsers through the Chrome DevTools Protocol (CDP) - a websocket-based API that sends commands to the browser. This means there are also lighter CDP clients that can be useful for scraping:

These are covered in detail in the Browser Libraries overview.

Pro Tip: Capture Hidden APIs

One of the most powerful browser scraping techniques is to use the browser to discover hidden APIs, then scrape those APIs directly with an HTTP client. This gives you the speed of HTTP scraping with the data access of browser scraping.

```python from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page()

# Capture background network requests
api_responses = []
page.on("response", lambda resp: api_responses.append(resp)
        if "/api/" in resp.url else None)

page.goto("https://example.com/products")
page.wait_for_timeout(3000)  # wait for API calls

# Now you know the hidden API endpoints
for resp in api_responses:
    print(f"Found API: {resp.url}")
    # Scrape this endpoint directly with httpx next time!

browser.close()

```

This technique is covered in depth in the Scrapfly Academy: Hidden API Scraping lesson.

Performance Considerations

Headless browsers are 10-100x slower than HTTP clients and use significantly more memory. To minimize the overhead:

  • Block unnecessary resources (images, CSS, fonts) to speed up page loads
  • Reuse browser instances instead of launching a new one per request
  • Use HTTP clients when possible - only fall back to browsers when needed
  • Consider a scraping API like Scrapfly that handles JavaScript rendering server-side

Next Steps

Was this page helpful?