Botasaurus is an all-in-one Python web scraping framework that combines browser automation,
anti-detection, and scaling features into a single package. It aims to simplify the entire
web scraping workflow from development to deployment.
Key features include:
- Anti-detect browser
Ships with a stealth-patched browser that passes common bot detection tests. Automatically
handles fingerprinting, user agent rotation, and other anti-detection measures.
- Decorator-based API
Uses Python decorators (@browser, @request) to define scraping tasks, making code clean
and easy to organize.
- Built-in parallelism
Easy parallel execution of scraping tasks across multiple browser instances with
configurable concurrency.
- Caching
Built-in caching layer to avoid re-scraping pages during development and debugging.
- Profile persistence
Can save and reuse browser profiles (cookies, localStorage) across scraping sessions
for maintaining login state.
- Output handling
Automatic output to JSON, CSV, or custom formats with built-in data filtering.
- Web dashboard
Includes a web UI for monitoring scraping progress, viewing results, and managing tasks.
Botasaurus is designed for developers who want a batteries-included framework that handles
anti-detection automatically, without needing to manually configure stealth settings or
manage browser fingerprints.
ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
```python
from botasaurus.browser import browser, Driver
from botasaurus.request import request, Request
# Browser-based scraping with anti-detection
@browser(parallel=3, cache=True)
def scrape_products(driver: Driver, url: str):
driver.get(url)
# Wait for content to load
driver.wait_for_element(".product-list")
# Extract product data
products = []
for el in driver.select_all(".product-card"):
products.append({
"name": el.select(".product-name").text,
"price": el.select(".product-price").text,
"url": el.select("a").get_attribute("href"),
})
return products
# HTTP-based scraping (no browser needed)
@request(parallel=5, cache=True)
def scrape_api(req: Request, url: str):
response = req.get(url)
return response.json()
# Run the scraper
results = scrape_products(
["https://example.com/page/1", "https://example.com/page/2"]
)
```
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```