Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript,
built by Apify. It is the successor to the Apify SDK and provides a unified interface for building
reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction
projects.
Crawlee supports multiple crawling strategies through different crawler classes:
- CheerioCrawler
For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
- PlaywrightCrawler
Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs,
and complex interactions.
- PuppeteerCrawler
Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
- HttpCrawler
Minimal crawler for raw HTTP requests without HTML parsing.
Key features include:
- Automatic request queue management with configurable concurrency and rate limiting
- Built-in proxy rotation with session management
- Persistent request queue and dataset storage (local or cloud via Apify)
- Automatic retry and error handling with configurable strategies
- TypeScript-first design with full type safety
- Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
- Output pipelines for storing extracted data
- Easy deployment to Apify cloud platform
Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript
ecosystem, comparable to Python's Scrapy but with native browser automation support.
Gracy is an API client library based on httpx that provides an extra stability layer with:
- Retry logic
- Logging
- Connection throttling
- Tracking/Middleware
In web scraping, Gracy can be a convenient tool for creating scraper based API clients.
```javascript
import { PlaywrightCrawler, Dataset } from 'crawlee';
// Create a crawler with Playwright for JS rendering
const crawler = new PlaywrightCrawler({
// Limit concurrency to avoid overwhelming the target
maxConcurrency: 5,
// This function is called for each URL
async requestHandler({ request, page, enqueueLinks }) {
const title = await page.title();
// Extract data from the page
const products = await page.$$eval('.product', (els) =>
els.map((el) => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
}))
);
// Store extracted data
await Dataset.pushData({
url: request.url,
title,
products,
});
// Follow links to crawl more pages
await enqueueLinks({
globs: ['https://example.com/products/**'],
});
},
});
// Start crawling
await crawler.run(['https://example.com/products']);
```
```python
# 0. Import
import asyncio
from typing import Awaitable
from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel
# 1. Define your endpoints
class PokeApiEndpoint(BaseEndpoint):
GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed
# 2. Define your Graceful API
class GracefulPokeAPI(Gracy[str]):
class Config: # type: ignore
BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL
# 👇 Define settings to apply for every request
SETTINGS = GracyConfig(
log_request=LogEvent(LogLevel.DEBUG),
log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"),
parser={
"default": lambda r: r.json()
}
)
async def get_pokemon(self, name: str) -> Awaitable[dict]:
return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name})
# Note: since Gracy is based on httpx we can customized the used client with custom headers etc"
def _create_client(self) -> httpx.AsyncClient:
client = super()._create_client()
client.headers = {"User-Agent": f"My Scraper"}
return client
pokeapi = GracefulPokeAPI()
async def main():
try:
pokemon = await pokeapi.get_pokemon("pikachu")
print(pokemon)
finally:
pokeapi.report_status("rich")
asyncio.run(main())
```