crawlee
Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript, built by Apify. It is the successor to the Apify SDK and provides a unified interface for building reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction projects.
Crawlee supports multiple crawling strategies through different crawler classes:
- CheerioCrawler For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
- PlaywrightCrawler Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs, and complex interactions.
- PuppeteerCrawler Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
- HttpCrawler Minimal crawler for raw HTTP requests without HTML parsing.
Key features include:
- Automatic request queue management with configurable concurrency and rate limiting
- Built-in proxy rotation with session management
- Persistent request queue and dataset storage (local or cloud via Apify)
- Automatic retry and error handling with configurable strategies
- TypeScript-first design with full type safety
- Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
- Output pipelines for storing extracted data
- Easy deployment to Apify cloud platform
Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript ecosystem, comparable to Python's Scrapy but with native browser automation support.
Highlights
Example Use
```javascript import { PlaywrightCrawler, Dataset } from 'crawlee';
// Create a crawler with Playwright for JS rendering const crawler = new PlaywrightCrawler({ // Limit concurrency to avoid overwhelming the target maxConcurrency: 5,
// This function is called for each URL
async requestHandler({ request, page, enqueueLinks }) {
const title = await page.title();
// Extract data from the page
const products = await page.$$eval('.product', (els) =>
els.map((el) => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
}))
);
// Store extracted data
await Dataset.pushData({
url: request.url,
title,
products,
});
// Follow links to crawl more pages
await enqueueLinks({
globs: ['https://example.com/products/**'],
});
},
});
// Start crawling await crawler.run(['https://example.com/products']); ```