Skip to content

crawleevsscrapy

Apache-2.0 175 26 22,720
341.9 thousand (month) Apr 22 2022 3.16.0(2026-04-09 07:36:53 ago)
61,276 30 640 BSD-3-Clause
Jul 26 2019 3.1 million (month) 2.15.0(2026-04-09 12:02:09 ago)

Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript, built by Apify. It is the successor to the Apify SDK and provides a unified interface for building reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction projects.

Crawlee supports multiple crawling strategies through different crawler classes:

  • CheerioCrawler For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
  • PlaywrightCrawler Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs, and complex interactions.
  • PuppeteerCrawler Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
  • HttpCrawler Minimal crawler for raw HTTP requests without HTML parsing.

Key features include:

  • Automatic request queue management with configurable concurrency and rate limiting
  • Built-in proxy rotation with session management
  • Persistent request queue and dataset storage (local or cloud via Apify)
  • Automatic retry and error handling with configurable strategies
  • TypeScript-first design with full type safety
  • Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
  • Output pipelines for storing extracted data
  • Easy deployment to Apify cloud platform

Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript ecosystem, comparable to Python's Scrapy but with native browser automation support.

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

Highlights


populartypescriptextendiblemiddlewaresoutput-pipelineslarge-scaleproxy
popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


```javascript import { PlaywrightCrawler, Dataset } from 'crawlee'; // Create a crawler with Playwright for JS rendering const crawler = new PlaywrightCrawler({ // Limit concurrency to avoid overwhelming the target maxConcurrency: 5, // This function is called for each URL async requestHandler({ request, page, enqueueLinks }) { const title = await page.title(); // Extract data from the page const products = await page.$$eval('.product', (els) => els.map((el) => ({ name: el.querySelector('.name')?.textContent, price: el.querySelector('.price')?.textContent, })) ); // Store extracted data await Dataset.pushData({ url: request.url, title, products, }); // Follow links to crawl more pages await enqueueLinks({ globs: ['https://example.com/products/**'], }); }, }); // Start crawling await crawler.run(['https://example.com/products']); ```

Alternatives / Similar


Was this page helpful?