puppeteer-stealthvshrequests
Puppeteer Stealth is puppeteer plugin that fortifies headles browser for web scraping. This makes detection of puppeteer scrapers more difficult allowing to scrape targets which use headless browser detection techniques.
Puppeteer-stealth does this by applying various javascript patches to cover up traces of headless browser presence in the web scraping browser's environment.
hrequests is a feature rich modern replacement for a famous requests library for Python. It provides a feature rich HTTP client capable of resisting popular scraper identification techniques: - Seamless transition between headless browser and http client based requests - Integrated HTML parser - Mimicking of real browser TLS fingerprints - Javascript rendering - HTTP2 support - Realistic browser headers
Highlights
Example Use
const puppeteer = require('puppeteer-extra')
// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log('Running tests..')
const page = await browser.newPage()
await page.goto('https://bot.sannysoft.com')
await page.waitForTimeout(5000)
await page.screenshot({ path: 'result.png', fullPage: true })
await browser.close()
console.log("success - check the result.png screenshot")
})
import hrequests
# perform HTTP client requests
resp = hrequests.get('https://httpbin.org/html')
print(resp.status_code)
# 200
# use headless browsers and sessions:
session = hrequests.Session('chrome', version=122, os="mac")
# supports asyncio and easy concurrency
requests = [
hrequests.async_get('https://www.google.com/', browser='firefox'),
hrequests.async_get('https://www.duckduckgo.com/'),
hrequests.async_get('https://www.yahoo.com/'),
hrequests.async_get('https://www.httpbin.org/'),
]
responses = hrequests.map(requests, size=3) # max 3 conccurency