botasaurus
Botasaurus is an all-in-one Python web scraping framework that combines browser automation, anti-detection, and scaling features into a single package. It aims to simplify the entire web scraping workflow from development to deployment.
Key features include:
- Anti-detect browser Ships with a stealth-patched browser that passes common bot detection tests. Automatically handles fingerprinting, user agent rotation, and other anti-detection measures.
- Decorator-based API Uses Python decorators (@browser, @request) to define scraping tasks, making code clean and easy to organize.
- Built-in parallelism Easy parallel execution of scraping tasks across multiple browser instances with configurable concurrency.
- Caching Built-in caching layer to avoid re-scraping pages during development and debugging.
- Profile persistence Can save and reuse browser profiles (cookies, localStorage) across scraping sessions for maintaining login state.
- Output handling Automatic output to JSON, CSV, or custom formats with built-in data filtering.
- Web dashboard Includes a web UI for monitoring scraping progress, viewing results, and managing tasks.
Botasaurus is designed for developers who want a batteries-included framework that handles anti-detection automatically, without needing to manually configure stealth settings or manage browser fingerprints.
Highlights
Example Use
```python from botasaurus.browser import browser, Driver from botasaurus.request import request, Request
Browser-based scraping with anti-detection
@browser(parallel=3, cache=True) def scrape_products(driver: Driver, url: str): driver.get(url)
# Wait for content to load
driver.wait_for_element(".product-list")
# Extract product data
products = []
for el in driver.select_all(".product-card"):
products.append({
"name": el.select(".product-name").text,
"price": el.select(".product-price").text,
"url": el.select("a").get_attribute("href"),
})
return products
HTTP-based scraping (no browser needed)
@request(parallel=5, cache=True) def scrape_api(req: Request, url: str): response = req.get(url) return response.json()
Run the scraper
results = scrape_products( ["https://example.com/page/1", "https://example.com/page/2"] ) ```