Skip to content

node-crawlervsfirecrawl

MIT 30 6 6,790
15.3 thousand (month) Sep 10 2012 2.0.2(2025-05-28 09:36:01 ago)
- - - None
Apr 01 2024 0.0.0(2025-03-15 00:00:00 ago)

node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.

Features:

  • Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
  • Configurable pool size and retries,
  • Control rate limit,
  • Priority queue of requests,
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion,
  • Compatible with 4.x or newer version.
  • Http2 support
  • Proxy support

Firecrawl is an AI-powered web scraping API that converts web pages into clean Markdown or structured data, optimized for use with large language models (LLMs) and retrieval-augmented generation (RAG) pipelines. It handles JavaScript rendering, anti-bot bypass, and content extraction automatically.

Firecrawl offers multiple modes:

  • Scrape Convert a single URL into clean Markdown, HTML, or structured data. Handles JavaScript rendering and anti-bot protections automatically.
  • Crawl Crawl an entire website starting from a URL, with configurable depth, URL patterns, and page limits. Returns all pages as clean Markdown.
  • Map Quickly discover all URLs on a website without fully scraping each page. Useful for sitemap generation and crawl planning.
  • Extract Use LLMs to extract specific structured data from pages based on a schema definition.

Key features:

  • Clean Markdown output ideal for LLM context windows
  • Automatic JavaScript rendering with headless browsers
  • Built-in anti-bot bypass for protected websites
  • Structured extraction with JSON schemas
  • Batch crawling with webhook notifications
  • Python and JavaScript SDKs

Firecrawl is a commercial API service (requires API key, has a free tier) backed by Y Combinator. It has become one of the most popular tools for feeding web content into AI applications and is widely used in the LLM/RAG ecosystem.

Note: while the primary service is an API, the core is open source and can be self-hosted.

Highlights


ai-poweredpopularasync

Example Use


```javascript const Crawler = require('crawler'); const c = new Crawler({ maxConnections: 10, // This will be called for each crawled page callback: (error, res, done) => { if (error) { console.log(error); } else { const $ = res.$; // $ is Cheerio by default //a lean implementation of core jQuery designed specifically for the server console.log($('title').text()); } done(); } }); // Queue just one URL, with default callback c.queue('http://www.amazon.com'); // Queue a list of URLs c.queue(['http://www.google.com/','http://www.yahoo.com']); // Queue URLs with custom callbacks & parameters c.queue([{ uri: 'http://parishackers.org/', jQuery: false, // The global callback won't be called callback: (error, res, done) => { if (error) { console.log(error); } else { console.log('Grabbed', res.body.length, 'bytes'); } done(); } }]); // Queue some HTML code directly without grabbing (mostly for tests) c.queue([{ html: '

This is a test

' }]); ```
```python from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="YOUR_API_KEY") # Scrape a single page - get clean markdown result = app.scrape_url("https://example.com/blog/article") print(result["markdown"]) # clean markdown content # Extract structured data with a schema result = app.scrape_url( "https://example.com/product/123", params={ "formats": ["extract"], "extract": { "schema": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "number"}, "description": {"type": "string"}, }, } }, }, ) print(result["extract"]) # {"name": "...", "price": 29.99, ...} # Crawl an entire website crawl_result = app.crawl_url( "https://example.com", params={"limit": 100, "scrapeOptions": {"formats": ["markdown"]}}, ) for page in crawl_result["data"]: print(page["metadata"]["title"], page["markdown"][:100]) # Map all URLs on a site map_result = app.map_url("https://example.com") print(f"Found {len(map_result['links'])} URLs") ```

Alternatives / Similar


Was this page helpful?