Browser Automation
An increasingly popular way to web scrape is to use browser automation tools instead of HTTP clients. Using an entire browser to retrieve the contents of a web page might appear as overkill but it has several benefits:
Pros
- Browser scrapers are harder to identify and block since they appear like real users.
- Browsers execute javascript which makes scraping dynamic pages and web apps possible without reverse engineering the website.
- Often easier to develop as we can use human-like instructions: click button here, enter text there.
Cons
- Much more resource intensive and slower
- Harder to scale
- More error prone (browsers are very complicated)
There are primarily 3 major browser automation toolkits: Playwright, Puppeteer and Selenium.
:simple-playwright: Playwright
Playwright is the newest library with a rapidly growing community. It's available in many different languages and features both asynchronous and synchronous clients. Playwright has the most modern and easy to follow programming API out of all browser automation libraries.
Example: Python + Playwright
This example uses Python and Playwright to scrape this page and parsel to parse it:
# Playwright has 2 APIs: synchronous one:
from playwright.sync_api import sync_playwright
# and asynchronous one:
from playwright.async_api import async_playwright
from parsel import Selector
# synchronous API is used in this snippet:
with sync_playwright() as pw:
browser = pw.chromium.launch()
page = browser.new_page()
# navigate to the page
page.goto("https://webscraping.fyi/overview/browser-automation/")
# wait for page to load by checking for presence of a loaded element:
page.wait_for_selector("h2#playwright")
# then we can retrieve the page source and parse it
html = page.content()
selector = Selector(text=html)
this_snippet = ''.join(selector.xpath("//h2[@id='playwright']/following-sibling::details[1]//text()").getall())
print(this_snippet)
Puppeteer
Puppeteer is only available in Javascript (NodeJS) so while it's less accessible it does have a bigger and older community than Playwright. Puppeteer being more mature also means it has bigger web scraping community.
Example: NodeJS + Puppeteer
This example uses NodeJS and Puppeteer to scrape this page and cheerio to parse it:
// import puppeteer and cheerio
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
async function scrape(){
// launch puppeteer browser
const browser = await puppeteer.launch({headless: false});
let page = await browser.newpage();
await page.goto('https://webscraping.fyi/overview/browser-automation/', {
waituntil: 'domcontentloaded',
});
// wait for the page to load
await page.waitforselector('h2#selenium', {timeout: 5_000});
// parse data with cheerio
const html = await page.content();
const tree = cheerio.load(html);
const thissnippet = tree('h2#puppeteer ~ details').text();
console.log(thissnippet);
// close everything
await page.close();
await browser.close();
}
scrape();
Selenium
Selenium is the oldest and the most mature browser automation toolkit. Maturity is key here as there are a lot of free educational resources online and open source scrapers based on Selenium.
Example: Python + Selenium
This example uses Python and Selenium to scrape this page and parsel to parse it:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector
# here we can configure selenium webdriver
options = Options()
options.headless = True # run browser in the background
options.add_argument("start-maximized") # ensure window is maximaized
options.add_argument("--window-size=1920,1080") # common window resolution to prevent blocking
driver = webdriver.Chrome(options=options)
driver.get("https://webscraping.fyi/overview/browser-automation/")
# wait for page to load by checking for presence of a loaded element:
element = WebDriverWait(driver=driver, timeout=5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'h2#selenium'))
)
# then we can retrieve the page source and parse it
html = driver.page_source
selector = Selector(text=html)
this_snippet = ''.join(selector.xpath("//h2[@id='selenium']/following-sibling::details[1]//text()").getall())
print(this_snippet)
Comparison
Generally, Playwright is the most feature-rich and modern toolkit however it can Puppeteer has a more mature community around it so finding web scraping extensions and resources is much easier.
Feature | 🥇Playwright | 🥈Puppeteer | 🥉Selenium |
---|---|---|---|
Languages | Python NodeJS Java NET |
NodeJS | Java Python :simple-csharp: C# Ruby NodeJS Kotlin |
Browsers | Chrome Firefox Safari |
Chrome Firefox |
Chrome Firefox Safari |
Async |
Chrome Devtools Protocol
CDP is the standard way how these libraries control the web browser.
In short, browsers like Chrome or Firefox can be launched with an open web socket connection which can be used to control the browser. This means these 3 tools are not the only libraries for browser automation and there are many more CDP clients in varying completeness.
Note that for web scraping we often only need basic browser functionality so alternative CDP clients can viable even if they don't implement all of the CDP functionalities.