Skip to content

Browser Automation

An increasingly popular way to web scrape is to use browser automation tools instead of HTTP clients. Using an entire browser to retrieve the contents of a web page might appear as overkill but it has several benefits:

Pros 👍

  • Browser scrapers are harder to identify and block since they appear like real users.
  • Browsers execute javascript which makes scraping dynamic pages and web apps possible without reverse engineering the website.
  • Often easier to develop as we can use human-like instructions: click button here, enter text there.

Cons 👎

  • Much more resource intensive and slower
  • Harder to scale
  • More error prone (browsers are very complicated)

There are primarily 3 major browser automation toolkits: Playwright, Puppeteer and Selenium.

:simple-playwright: Playwright

Playwright is the newest library with a rapidly growing community. It's available in many different languages and features both asynchronous and synchronous clients. Playwright has the most modern and easy to follow programming API out of all browser automation libraries.

Example: Python + Playwright

This example uses Python and Playwright to scrape this page and parsel to parse it: ```python

Playwright has 2 APIs: synchronous one:

from playwright.sync_api import sync_playwright

and asynchronous one:

from playwright.async_api import async_playwright from parsel import Selector

synchronous API is used in this snippet:

with sync_playwright() as pw: browser = pw.chromium.launch() page = browser.new_page() # navigate to the page page.goto("https://webscraping.fyi/overview/browser-automation/") # wait for page to load by checking for presence of a loaded element: page.wait_for_selector("h2#playwright") # then we can retrieve the page source and parse it html = page.content() selector = Selector(text=html) this_snippet = ''.join(selector.xpath("//h2[@id='playwright']/following-sibling::details[1]//text()").getall()) print(this_snippet) ```

Puppeteer

Puppeteer is only available in Javascript (NodeJS) so while it's less accessible it does have a bigger and older community than Playwright. Puppeteer being more mature also means it has bigger web scraping community.

Example: NodeJS + Puppeteer

This example uses NodeJS and Puppeteer to scrape this page and cheerio to parse it: ```python // import puppeteer and cheerio const puppeteer = require('puppeteer'); const cheerio = require('cheerio');

async function scrape(){ // launch puppeteer browser const browser = await puppeteer.launch({headless: false}); let page = await browser.newpage(); await page.goto('https://webscraping.fyi/overview/browser-automation/', { waituntil: 'domcontentloaded', }); // wait for the page to load await page.waitforselector('h2#selenium', {timeout: 5_000});

// parse data with cheerio
const html = await page.content();
const tree = cheerio.load(html);
const thissnippet = tree('h2#puppeteer ~ details').text();
console.log(thissnippet);

// close everything
await page.close();
await browser.close();

}

scrape(); ```

Selenium

Selenium is the oldest and the most mature browser automation toolkit. Maturity is key here as there are a lot of free educational resources online and open source scrapers based on Selenium.

Example: Python + Selenium

This example uses Python and Selenium to scrape this page and parsel to parse it: ```python from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from parsel import Selector

here we can configure selenium webdriver

options = Options() options.headless = True # run browser in the background options.add_argument("start-maximized") # ensure window is maximaized options.add_argument("--window-size=1920,1080") # common window resolution to prevent blocking

driver = webdriver.Chrome(options=options) driver.get("https://webscraping.fyi/overview/browser-automation/")

wait for page to load by checking for presence of a loaded element:

element = WebDriverWait(driver=driver, timeout=5).until( EC.presence_of_element_located((By.CSS_SELECTOR, 'h2#selenium')) )

then we can retrieve the page source and parse it

html = driver.page_source selector = Selector(text=html) this_snippet = ''.join(selector.xpath("//h2[@id='selenium']/following-sibling::details[1]//text()").getall()) print(this_snippet) ```

Comparison

Generally, Playwright is the most feature-rich and modern toolkit however it can Puppeteer has a more mature community around it so finding web scraping extensions and resources is much easier.

Feature 🥇Playwright 🥈Puppeteer 🥉Selenium
Languages Python
NodeJS
Java
NET
NodeJS Java
Python
:simple-csharp: C#
Ruby
NodeJS
Kotlin
Browsers Chrome
Firefox
Safari
Chrome
Firefox
Chrome
Firefox
Safari
Async

Chrome Devtools Protocol

CDP is the standard way how these libraries control the web browser.

image

In short, browsers like Chrome or Firefox can be launched with an open web socket connection which can be used to control the browser. This means these 3 tools are not the only libraries for browser automation and there are many more CDP clients in varying completeness.
Note that for web scraping we often only need basic browser functionality so alternative CDP clients can viable even if they don't implement all of the CDP functionalities.

Anti-Detect Browsers

A growing category of browser automation tools focuses specifically on evading bot detection systems. Unlike standard browser automation which aims to control the browser, anti-detect tools aim to make the automation invisible to anti-bot systems like Cloudflare, DataDome, and PerimeterX.

These tools address the main detection vectors:

  • WebDriver flag - Standard tools like Selenium set navigator.webdriver = true, which is trivially detectable. Anti-detect browsers remove or mask this.
  • CDP detection - Some anti-bot systems can detect Chrome DevTools Protocol connections. Anti-detect tools use patched browsers that hide these signals.
  • Browser fingerprinting - Anti-detect browsers generate realistic fingerprints (WebGL, Canvas, fonts, screen resolution) that match real user configurations.
  • TLS/HTTP fingerprinting - The TLS handshake and HTTP header ordering can identify automated tools. Some anti-detect solutions patch these at the network level.
Tool Base Browser Approach
nodriver Chrome Direct CDP without WebDriver, no automation flags
camoufox Firefox C++-level browser patches, realistic fingerprint generation
pydoll Chrome CDP-native with network interception and event-driven design

For most anti-detection needs, nodriver (Chrome-based) or camoufox (Firefox-based) are the recommended starting points.

AI-Powered Browser Automation

A new paradigm in browser automation uses large language models (LLMs) to control browsers through natural language instructions instead of explicit selectors and scripted interactions.

Instead of writing: python page.click("button.submit-btn") page.fill("input#email", "user@example.com")

You write: python agent.run("Fill in the email field and click submit")

The AI agent analyzes the page (via DOM inspection and/or screenshots) and determines which elements to interact with. This approach is particularly useful for:

  • Diverse page layouts - When scraping many different sites where selectors vary
  • Frequently changing UIs - When sites update their HTML structure regularly
  • Complex workflows - Multi-step processes that are tedious to script explicitly
  • Rapid prototyping - Getting a scraper working quickly without studying page structure
Tool Language LLM Providers
browser-use Python OpenAI, Anthropic, Google
stagehand NodeJS OpenAI, Anthropic
crawl4ai Python OpenAI, Anthropic, local

Note that AI-powered scraping adds latency and cost (LLM API calls per page) compared to traditional selector-based approaches. It's best used for prototyping or scraping tasks where selector maintenance cost exceeds LLM API cost.

For a full comparison of all browser-related libraries, see the Browser Libraries overview. For ready-to-use scrapers that handle anti-bot bypass, see the Web Scrapers section.

Was this page helpful?