Skip to content

Browser Automation

An increasingly popular way to web scrape is to use browser automation tools instead of HTTP clients. Using an entire browser to retrieve the contents of a web page might appear as overkill but it has several benefits:

Pros 👍

  • Browser scrapers are harder to identify and block since they appear like real users.
  • Browsers execute javascript which makes scraping dynamic pages and web apps possible without reverse engineering the website.
  • Often easier to develop as we can use human-like instructions: click button here, enter text there.

Cons 👎

  • Much more resource intensive and slower
  • Harder to scale
  • More error prone (browsers are very complicated)

There are primarily 3 major browser automation toolkits: Playwright, Puppeteer and Selenium.

Playwright

Playwright is the newest library with a rapidly growing community. It's available in many different languages and features both asynchronous and synchronous clients. Playwright has the most modern and easy to follow programming API out of all browser automation libraries.

Example: Python + Playwright

This example uses Python and Playwright to scrape this page and parsel to parse it:

# Playwright has 2 APIs: synchronous one:
from playwright.sync_api import sync_playwright
# and asynchronous one:
from playwright.async_api import async_playwright
from parsel import Selector

# synchronous API is used in this snippet:
with sync_playwright() as pw:
    browser = pw.chromium.launch()
    page = browser.new_page()
    # navigate to the page
    page.goto("https://webscraping.fyi/overview/browser-automation/")
    # wait for page to load by checking for presence of a loaded element:
    page.wait_for_selector("h2#playwright")
    # then we can retrieve the page source and parse it
    html = page.content()
    selector = Selector(text=html)
    this_snippet = ''.join(selector.xpath("//h2[@id='playwright']/following-sibling::details[1]//text()").getall())
    print(this_snippet)

Puppeteer

Puppeteer is only available in Javascript (NodeJS) so while it's less accessible it does have a bigger and older community than Playwright. Puppeteer being more mature also means it has bigger web scraping community.

Example: NodeJS + Puppeteer

This example uses NodeJS and Puppeteer to scrape this page and cheerio to parse it:

// import puppeteer and cheerio
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scrape(){
    // launch puppeteer browser
    const browser = await puppeteer.launch({headless: false});
    let page = await browser.newpage();
    await page.goto('https://webscraping.fyi/overview/browser-automation/', {
        waituntil: 'domcontentloaded',
    });
    // wait for the page to load
    await page.waitforselector('h2#selenium', {timeout: 5_000});

    // parse data with cheerio
    const html = await page.content();
    const tree = cheerio.load(html);
    const thissnippet = tree('h2#puppeteer ~ details').text();
    console.log(thissnippet);

    // close everything
    await page.close();
    await browser.close();
}

scrape();

Selenium

Selenium is the oldest and the most mature browser automation toolkit. Maturity is key here as there are a lot of free educational resources online and open source scrapers based on Selenium.

Example: Python + Selenium

This example uses Python and Selenium to scrape this page and parsel to parse it:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector

# here we can configure selenium webdriver
options = Options()
options.headless = True  # run browser in the background
options.add_argument("start-maximized")  # ensure window is maximaized
options.add_argument("--window-size=1920,1080")  # common window resolution to prevent blocking

driver = webdriver.Chrome(options=options)
driver.get("https://webscraping.fyi/overview/browser-automation/")
# wait for page to load by checking for presence of a loaded element:
element = WebDriverWait(driver=driver, timeout=5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'h2#selenium'))
)

# then we can retrieve the page source and parse it
html = driver.page_source
selector = Selector(text=html)
this_snippet = ''.join(selector.xpath("//h2[@id='selenium']/following-sibling::details[1]//text()").getall())
print(this_snippet)

Comparison

Generally, Playwright is the most feature-rich and modern toolkit however it can Puppeteer has a more mature community around it so finding web scraping extensions and resources is much easier.

Feature 🥇Playwright 🥈Puppeteer 🥉Selenium
Languages Python
NodeJS
Java
NET
NodeJS Java
Python
C#
Ruby
NodeJS
Kotlin
Browsers Chrome
Firefox
Safari
Chrome
Firefox
Chrome
Firefox
Safari
Async

Chrome Devtools Protocol

CDP is the standard way how these libraries control the web browser.

image

In short, browsers like Chrome or Firefox can be launched with an open web socket connection which can be used to control the browser. This means these 3 tools are not the only libraries for browser automation and there are many more CDP clients in varying completeness.
Note that for web scraping we often only need basic browser functionality so alternative CDP clients can viable even if they don't implement all of the CDP functionalities.