puppeteer-stealthvscurl-cffi
Puppeteer Stealth is puppeteer plugin that fortifies headles browser for web scraping. This makes detection of puppeteer scrapers more difficult allowing to scrape targets which use headless browser detection techniques.
Puppeteer-stealth does this by applying various javascript patches to cover up traces of headless browser presence in the web scraping browser's environment.
Curl-cffi is a Python library for implementing curl-impersonate which is a
HTTP client that appears as one of popular web browsers like:
- Google Chrome
- Microsoft Edge
- Safari
- Firefox
Unlike requests
and httpx
which are native Python libraries, curl-cffi
uses cURL and inherits it's powerful features
like extensive HTTP protocol support and detection patches for TLS and HTTP fingerprinting.
Using curl-cffi web scrapers can bypass TLS and HTTP fingerprinting.
Highlights
Example Use
const puppeteer = require('puppeteer-extra')
// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log('Running tests..')
const page = await browser.newPage()
await page.goto('https://bot.sannysoft.com')
await page.waitForTimeout(5000)
await page.screenshot({ path: 'result.png', fullPage: true })
await browser.close()
console.log("success - check the result.png screenshot")
})
from curl_cffi import requests
response = requests.get('https://httpbin.org/json')
print(response.json())
# or using sessions
session = requests.Session()
response = session.get('https://httpbin.org/json')
# also supports async requests using asyncio
import asyncio
from curl_cffi.requests import AsyncSession
urls = [
"http://httpbin.org/html",
"http://httpbin.org/html",
"http://httpbin.org/html",
]
async with AsyncSession() as s:
tasks = []
for url in urls:
task = s.get(url)
tasks.append(task)
# scrape concurrently:
responses = await asyncio.gather(*tasks)
# also supports websocket connections
from curl_cffi.requests import Session, WebSocket
def on_message(ws: WebSocket, message):
print(message)
with Session() as s:
ws = s.ws_connect(
"wss://api.gemini.com/v1/marketdata/BTCUSD",
on_message=on_message,
)
ws.run_forever()