Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript,
built by Apify. It is the successor to the Apify SDK and provides a unified interface for building
reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction
projects.
Crawlee supports multiple crawling strategies through different crawler classes:
- CheerioCrawler
For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
- PlaywrightCrawler
Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs,
and complex interactions.
- PuppeteerCrawler
Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
- HttpCrawler
Minimal crawler for raw HTTP requests without HTML parsing.
Key features include:
- Automatic request queue management with configurable concurrency and rate limiting
- Built-in proxy rotation with session management
- Persistent request queue and dataset storage (local or cloud via Apify)
- Automatic retry and error handling with configurable strategies
- TypeScript-first design with full type safety
- Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
- Output pipelines for storing extracted data
- Easy deployment to Apify cloud platform
Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript
ecosystem, comparable to Python's Scrapy but with native browser automation support.
ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
```javascript
import { PlaywrightCrawler, Dataset } from 'crawlee';
// Create a crawler with Playwright for JS rendering
const crawler = new PlaywrightCrawler({
// Limit concurrency to avoid overwhelming the target
maxConcurrency: 5,
// This function is called for each URL
async requestHandler({ request, page, enqueueLinks }) {
const title = await page.title();
// Extract data from the page
const products = await page.$$eval('.product', (els) =>
els.map((el) => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
}))
);
// Store extracted data
await Dataset.pushData({
url: request.url,
title,
products,
});
// Follow links to crawl more pages
await enqueueLinks({
globs: ['https://example.com/products/**'],
});
},
});
// Start crawling
await crawler.run(['https://example.com/products']);
```
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```