Botasaurus is an all-in-one Python web scraping framework that combines browser automation,
anti-detection, and scaling features into a single package. It aims to simplify the entire
web scraping workflow from development to deployment.
Key features include:
- Anti-detect browser
Ships with a stealth-patched browser that passes common bot detection tests. Automatically
handles fingerprinting, user agent rotation, and other anti-detection measures.
- Decorator-based API
Uses Python decorators (@browser, @request) to define scraping tasks, making code clean
and easy to organize.
- Built-in parallelism
Easy parallel execution of scraping tasks across multiple browser instances with
configurable concurrency.
- Caching
Built-in caching layer to avoid re-scraping pages during development and debugging.
- Profile persistence
Can save and reuse browser profiles (cookies, localStorage) across scraping sessions
for maintaining login state.
- Output handling
Automatic output to JSON, CSV, or custom formats with built-in data filtering.
- Web dashboard
Includes a web UI for monitoring scraping progress, viewing results, and managing tasks.
Botasaurus is designed for developers who want a batteries-included framework that handles
anti-detection automatically, without needing to manually configure stealth settings or
manage browser fingerprints.
Ruia is an async web scraping micro-framework, written with asyncio and aiohttp,
aims to make crawling url as convenient as possible.
Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.
It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.
```python
from botasaurus.browser import browser, Driver
from botasaurus.request import request, Request
# Browser-based scraping with anti-detection
@browser(parallel=3, cache=True)
def scrape_products(driver: Driver, url: str):
driver.get(url)
# Wait for content to load
driver.wait_for_element(".product-list")
# Extract product data
products = []
for el in driver.select_all(".product-card"):
products.append({
"name": el.select(".product-name").text,
"price": el.select(".product-price").text,
"url": el.select("a").get_attribute("href"),
})
return products
# HTTP-based scraping (no browser needed)
@request(parallel=5, cache=True)
def scrape_api(req: Request, url: str):
response = req.get(url)
return response.json()
# Run the scraper
results = scrape_products(
["https://example.com/page/1", "https://example.com/page/2"]
)
```
```python
#!/usr/bin/env python
"""
Target: https://news.ycombinator.com/
pip install aiofiles
"""
import aiofiles
from ruia import AttrField, Item, Spider, TextField
class HackerNewsItem(Item):
target_item = TextField(css_select="tr.athing")
title = TextField(css_select="a.storylink")
url = AttrField(css_select="a.storylink", attr="href")
async def clean_title(self, value):
return value.strip()
class HackerNewsSpider(Spider):
start_urls = [
"https://news.ycombinator.com/news?p=1",
"https://news.ycombinator.com/news?p=2",
]
concurrency = 10
# aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
async with aiofiles.open("./hacker_news.txt", "a") as f:
self.logger.info(item)
await f.write(str(item.title) + "\n")
if __name__ == "__main__":
HackerNewsSpider.start(middleware=None)
```