Browser Automation
An increasingly popular way to web scrape is to use browser automation tools instead of HTTP clients. Using an entire browser to retrieve the contents of a web page might appear as overkill but it has several benefits:
Pros
- Browser scrapers are harder to identify and block since they appear like real users.
- Browsers execute javascript which makes scraping dynamic pages and web apps possible without reverse engineering the website.
- Often easier to develop as we can use human-like instructions: click button here, enter text there.
Cons
- Much more resource intensive and slower
- Harder to scale
- More error prone (browsers are very complicated)
There are primarily 3 major browser automation toolkits: Playwright, Puppeteer and Selenium.
:simple-playwright: Playwright
Playwright is the newest library with a rapidly growing community. It's available in many different languages and features both asynchronous and synchronous clients. Playwright has the most modern and easy to follow programming API out of all browser automation libraries.
Example: Python + Playwright
This example uses Python and Playwright to scrape this page and parsel to parse it: ```python
Playwright has 2 APIs: synchronous one:
from playwright.sync_api import sync_playwright
and asynchronous one:
from playwright.async_api import async_playwright from parsel import Selector
synchronous API is used in this snippet:
with sync_playwright() as pw: browser = pw.chromium.launch() page = browser.new_page() # navigate to the page page.goto("https://webscraping.fyi/overview/browser-automation/") # wait for page to load by checking for presence of a loaded element: page.wait_for_selector("h2#playwright") # then we can retrieve the page source and parse it html = page.content() selector = Selector(text=html) this_snippet = ''.join(selector.xpath("//h2[@id='playwright']/following-sibling::details[1]//text()").getall()) print(this_snippet) ```
Puppeteer
Puppeteer is only available in Javascript (NodeJS) so while it's less accessible it does have a bigger and older community than Playwright. Puppeteer being more mature also means it has bigger web scraping community.
Example: NodeJS + Puppeteer
This example uses NodeJS and Puppeteer to scrape this page and cheerio to parse it: ```python // import puppeteer and cheerio const puppeteer = require('puppeteer'); const cheerio = require('cheerio');
async function scrape(){ // launch puppeteer browser const browser = await puppeteer.launch({headless: false}); let page = await browser.newpage(); await page.goto('https://webscraping.fyi/overview/browser-automation/', { waituntil: 'domcontentloaded', }); // wait for the page to load await page.waitforselector('h2#selenium', {timeout: 5_000});
// parse data with cheerio
const html = await page.content();
const tree = cheerio.load(html);
const thissnippet = tree('h2#puppeteer ~ details').text();
console.log(thissnippet);
// close everything
await page.close();
await browser.close();
}
scrape(); ```
Selenium
Selenium is the oldest and the most mature browser automation toolkit. Maturity is key here as there are a lot of free educational resources online and open source scrapers based on Selenium.
Example: Python + Selenium
This example uses Python and Selenium to scrape this page and parsel to parse it: ```python from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from parsel import Selector
here we can configure selenium webdriver
options = Options() options.headless = True # run browser in the background options.add_argument("start-maximized") # ensure window is maximaized options.add_argument("--window-size=1920,1080") # common window resolution to prevent blocking
driver = webdriver.Chrome(options=options) driver.get("https://webscraping.fyi/overview/browser-automation/")
wait for page to load by checking for presence of a loaded element:
element = WebDriverWait(driver=driver, timeout=5).until( EC.presence_of_element_located((By.CSS_SELECTOR, 'h2#selenium')) )
then we can retrieve the page source and parse it
html = driver.page_source selector = Selector(text=html) this_snippet = ''.join(selector.xpath("//h2[@id='selenium']/following-sibling::details[1]//text()").getall()) print(this_snippet) ```
Comparison
Generally, Playwright is the most feature-rich and modern toolkit however it can Puppeteer has a more mature community around it so finding web scraping extensions and resources is much easier.
| Feature | 🥇Playwright | 🥈Puppeteer | 🥉Selenium |
|---|---|---|---|
| Languages | Python NodeJS Java NET |
NodeJS | Java Python :simple-csharp: C# Ruby NodeJS Kotlin |
| Browsers | Chrome Firefox Safari |
Chrome Firefox |
Chrome Firefox Safari |
| Async |
Chrome Devtools Protocol
CDP is the standard way how these libraries control the web browser.
In short, browsers like Chrome or Firefox can be launched with an open web socket connection which can be used to control the browser. This means these 3 tools are not the only libraries for browser automation and there are many more CDP clients in varying completeness.
Note that for web scraping we often only need basic browser functionality so alternative CDP clients can viable even if they don't implement all of the CDP functionalities.
Anti-Detect Browsers
A growing category of browser automation tools focuses specifically on evading bot detection systems. Unlike standard browser automation which aims to control the browser, anti-detect tools aim to make the automation invisible to anti-bot systems like Cloudflare, DataDome, and PerimeterX.
These tools address the main detection vectors:
- WebDriver flag - Standard tools like Selenium set
navigator.webdriver = true, which is trivially detectable. Anti-detect browsers remove or mask this. - CDP detection - Some anti-bot systems can detect Chrome DevTools Protocol connections. Anti-detect tools use patched browsers that hide these signals.
- Browser fingerprinting - Anti-detect browsers generate realistic fingerprints (WebGL, Canvas, fonts, screen resolution) that match real user configurations.
- TLS/HTTP fingerprinting - The TLS handshake and HTTP header ordering can identify automated tools. Some anti-detect solutions patch these at the network level.
| Tool | Base Browser | Approach |
|---|---|---|
| nodriver | Chrome | Direct CDP without WebDriver, no automation flags |
| camoufox | Firefox | C++-level browser patches, realistic fingerprint generation |
| pydoll | Chrome | CDP-native with network interception and event-driven design |
For most anti-detection needs, nodriver (Chrome-based) or camoufox (Firefox-based) are the recommended starting points.
AI-Powered Browser Automation
A new paradigm in browser automation uses large language models (LLMs) to control browsers through natural language instructions instead of explicit selectors and scripted interactions.
Instead of writing:
python
page.click("button.submit-btn")
page.fill("input#email", "user@example.com")
You write:
python
agent.run("Fill in the email field and click submit")
The AI agent analyzes the page (via DOM inspection and/or screenshots) and determines which elements to interact with. This approach is particularly useful for:
- Diverse page layouts - When scraping many different sites where selectors vary
- Frequently changing UIs - When sites update their HTML structure regularly
- Complex workflows - Multi-step processes that are tedious to script explicitly
- Rapid prototyping - Getting a scraper working quickly without studying page structure
| Tool | Language | LLM Providers |
|---|---|---|
| browser-use | Python | OpenAI, Anthropic, Google |
| stagehand | NodeJS | OpenAI, Anthropic |
| crawl4ai | Python | OpenAI, Anthropic, local |
Note that AI-powered scraping adds latency and cost (LLM API calls per page) compared to traditional selector-based approaches. It's best used for prototyping or scraping tasks where selector maintenance cost exceeds LLM API cost.
For a full comparison of all browser-related libraries, see the Browser Libraries overview. For ready-to-use scrapers that handle anti-bot bypass, see the Web Scrapers section.