Skip to content

crawleevsscrapegraphai

Apache-2.0 175 26 22,720
341.9 thousand (month) Apr 22 2022 3.16.0(2026-04-09 07:36:53 ago)
23,278 17 4 MIT
Jan 15 2024 59.6 thousand (month) 1.76.0(2026-04-09 09:41:03 ago)

Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript, built by Apify. It is the successor to the Apify SDK and provides a unified interface for building reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction projects.

Crawlee supports multiple crawling strategies through different crawler classes:

  • CheerioCrawler For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
  • PlaywrightCrawler Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs, and complex interactions.
  • PuppeteerCrawler Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
  • HttpCrawler Minimal crawler for raw HTTP requests without HTML parsing.

Key features include:

  • Automatic request queue management with configurable concurrency and rate limiting
  • Built-in proxy rotation with session management
  • Persistent request queue and dataset storage (local or cloud via Apify)
  • Automatic retry and error handling with configurable strategies
  • TypeScript-first design with full type safety
  • Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
  • Output pipelines for storing extracted data
  • Easy deployment to Apify cloud platform

Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript ecosystem, comparable to Python's Scrapy but with native browser automation support.

ScrapeGraphAI is a Python library that uses large language models (LLMs) to create web scraping pipelines automatically. Instead of writing CSS selectors or XPath expressions, you describe what data you want in natural language and provide a Pydantic schema — the library handles the rest.

Key features include:

  • Natural language extraction Describe what you want to extract in plain English (e.g., "Extract all product names and prices") and the LLM figures out how to find and extract the data.
  • Pydantic schema output Define the expected output structure using Pydantic models for type-safe, validated extraction results.
  • Graph-based pipeline Built on a directed graph architecture where each node performs a specific task (fetching, parsing, extracting, merging). This makes pipelines modular and debuggable.
  • Multiple graph types SmartScraperGraph (single page), SearchGraph (search + scrape), SpeechGraph (audio output), and more specialized pipelines.
  • Multiple LLM providers Works with OpenAI, Anthropic, Google, Groq, local models via Ollama, and more.
  • HTML and JSON support Can extract data from both HTML pages and JSON API responses.

ScrapeGraphAI is particularly useful for rapid prototyping of scrapers and for extracting data from pages with complex or frequently changing layouts where traditional selectors would be brittle.

Highlights


populartypescriptextendiblemiddlewaresoutput-pipelineslarge-scaleproxy
ai-poweredpopular

Example Use


```javascript import { PlaywrightCrawler, Dataset } from 'crawlee'; // Create a crawler with Playwright for JS rendering const crawler = new PlaywrightCrawler({ // Limit concurrency to avoid overwhelming the target maxConcurrency: 5, // This function is called for each URL async requestHandler({ request, page, enqueueLinks }) { const title = await page.title(); // Extract data from the page const products = await page.$$eval('.product', (els) => els.map((el) => ({ name: el.querySelector('.name')?.textContent, price: el.querySelector('.price')?.textContent, })) ); // Store extracted data await Dataset.pushData({ url: request.url, title, products, }); // Follow links to crawl more pages await enqueueLinks({ globs: ['https://example.com/products/**'], }); }, }); // Start crawling await crawler.run(['https://example.com/products']); ```
```python from scrapegraphai.graphs import SmartScraperGraph from pydantic import BaseModel, Field from typing import List # Define the output schema class Product(BaseModel): name: str = Field(description="Product name") price: float = Field(description="Price in USD") rating: float = Field(description="Customer rating out of 5") class ProductList(BaseModel): products: List[Product] # Create a scraping graph with natural language instruction graph = SmartScraperGraph( prompt="Extract all products with their names, prices, and ratings", source="https://example.com/products", schema=ProductList, config={ "llm": { "model": "openai/gpt-4o", "api_key": "YOUR_API_KEY", }, }, ) # Run the graph result = graph.run() for product in result["products"]: print(f"{product['name']}: ${product['price']} ({product['rating']}/5)") ```

Alternatives / Similar


Was this page helpful?