Skip to content

firecrawlvskimurai

None - - -
Apr 01 2024 0.0.0(2025-03-15 00:00:00 ago)
1,098 1 14 MIT
Aug 23 2018 2.4 thousand (month) 2.2.0(2026-01-27 17:36:19 ago)

Firecrawl is an AI-powered web scraping API that converts web pages into clean Markdown or structured data, optimized for use with large language models (LLMs) and retrieval-augmented generation (RAG) pipelines. It handles JavaScript rendering, anti-bot bypass, and content extraction automatically.

Firecrawl offers multiple modes:

  • Scrape Convert a single URL into clean Markdown, HTML, or structured data. Handles JavaScript rendering and anti-bot protections automatically.
  • Crawl Crawl an entire website starting from a URL, with configurable depth, URL patterns, and page limits. Returns all pages as clean Markdown.
  • Map Quickly discover all URLs on a website without fully scraping each page. Useful for sitemap generation and crawl planning.
  • Extract Use LLMs to extract specific structured data from pages based on a schema definition.

Key features:

  • Clean Markdown output ideal for LLM context windows
  • Automatic JavaScript rendering with headless browsers
  • Built-in anti-bot bypass for protected websites
  • Structured extraction with JSON schemas
  • Batch crawling with webhook notifications
  • Python and JavaScript SDKs

Firecrawl is a commercial API service (requires API key, has a free tier) backed by Y Combinator. It has become one of the most popular tools for feeding web content into AI applications and is widely used in the LLM/RAG ecosystem.

Note: while the primary service is an API, the core is open source and can be self-hosted.

Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides a structured approach to building web scrapers with built-in support for multiple browser engines, session management, and data pipelines.

Key features include:

  • Multiple engine support Can use different backends depending on the scraping needs: Mechanize for simple HTTP requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and Poltergeist (PhantomJS) for lightweight rendering.
  • Scrapy-like architecture Follows the spider pattern: define a spider class with start URLs and parsing methods, and the framework handles crawling, scheduling, and data collection.
  • Built-in data pipelines Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
  • Session management Maintains browser sessions with automatic cookie handling and configurable delays between requests.
  • Request scheduling Built-in request queue with configurable concurrency, delays, and retry logic.
  • CLI tools Command-line tools for generating new spiders, running individual spiders, and managing scraping projects.

Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured scraping projects that need organization, multiple spiders, and data pipeline processing.

Note: Kimurai has not seen active development recently, but it remains a useful framework for Ruby scraping projects and is included as the most complete Ruby scraping framework available.

Highlights


ai-poweredpopularasync
middlewaresoutput-pipelines

Example Use


```python from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="YOUR_API_KEY") # Scrape a single page - get clean markdown result = app.scrape_url("https://example.com/blog/article") print(result["markdown"]) # clean markdown content # Extract structured data with a schema result = app.scrape_url( "https://example.com/product/123", params={ "formats": ["extract"], "extract": { "schema": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "number"}, "description": {"type": "string"}, }, } }, }, ) print(result["extract"]) # {"name": "...", "price": 29.99, ...} # Crawl an entire website crawl_result = app.crawl_url( "https://example.com", params={"limit": 100, "scrapeOptions": {"formats": ["markdown"]}}, ) for page in crawl_result["data"]: print(page["metadata"]["title"], page["markdown"][:100]) # Map all URLs on a site map_result = app.map_url("https://example.com") print(f"Found {len(map_result['links'])} URLs") ```
```ruby require 'kimurai' class ProductSpider < Kimurai::Base @name = 'product_spider' @engine = :selenium_chrome # or :mechanize for simple pages @start_urls = ['https://example.com/products'] def parse(response, url:, data: {}) # Extract product data from current page response.css('.product').each do |product| item = { name: product.css('.name').text.strip, price: product.css('.price').text.strip, url: absolute_url(product.at_css('a')['href'], base: url), } # Send item to the pipeline save_to "products.json", item, format: :json end # Follow pagination links if next_page = response.at_css('a.next-page') request_to :parse, url: absolute_url(next_page['href'], base: url) end end end # Run the spider ProductSpider.crawl! ```

Alternatives / Similar


Was this page helpful?