Skip to content

ruiavsfirecrawl

Apache-2.0 9 3 1,743
414 (month) Oct 17 2018 0.8.5(2022-09-06 08:54:56 ago)
- - - None
Apr 01 2024 0.0.0(2025-03-15 00:00:00 ago)

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.

It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.

Firecrawl is an AI-powered web scraping API that converts web pages into clean Markdown or structured data, optimized for use with large language models (LLMs) and retrieval-augmented generation (RAG) pipelines. It handles JavaScript rendering, anti-bot bypass, and content extraction automatically.

Firecrawl offers multiple modes:

  • Scrape Convert a single URL into clean Markdown, HTML, or structured data. Handles JavaScript rendering and anti-bot protections automatically.
  • Crawl Crawl an entire website starting from a URL, with configurable depth, URL patterns, and page limits. Returns all pages as clean Markdown.
  • Map Quickly discover all URLs on a website without fully scraping each page. Useful for sitemap generation and crawl planning.
  • Extract Use LLMs to extract specific structured data from pages based on a schema definition.

Key features:

  • Clean Markdown output ideal for LLM context windows
  • Automatic JavaScript rendering with headless browsers
  • Built-in anti-bot bypass for protected websites
  • Structured extraction with JSON schemas
  • Batch crawling with webhook notifications
  • Python and JavaScript SDKs

Firecrawl is a commercial API service (requires API key, has a free tier) backed by Y Combinator. It has become one of the most popular tools for feeding web content into AI applications and is widely used in the LLM/RAG ecosystem.

Note: while the primary service is an API, the core is open source and can be self-hosted.

Highlights


ai-poweredpopularasync

Example Use


```python #!/usr/bin/env python """ Target: https://news.ycombinator.com/ pip install aiofiles """ import aiofiles from ruia import AttrField, Item, Spider, TextField class HackerNewsItem(Item): target_item = TextField(css_select="tr.athing") title = TextField(css_select="a.storylink") url = AttrField(css_select="a.storylink", attr="href") async def clean_title(self, value): return value.strip() class HackerNewsSpider(Spider): start_urls = [ "https://news.ycombinator.com/news?p=1", "https://news.ycombinator.com/news?p=2", ] concurrency = 10 # aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"} async def parse(self, response): async for item in HackerNewsItem.get_items(html=await response.text()): yield item async def process_item(self, item: HackerNewsItem): async with aiofiles.open("./hacker_news.txt", "a") as f: self.logger.info(item) await f.write(str(item.title) + "\n") if __name__ == "__main__": HackerNewsSpider.start(middleware=None) ```
```python from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="YOUR_API_KEY") # Scrape a single page - get clean markdown result = app.scrape_url("https://example.com/blog/article") print(result["markdown"]) # clean markdown content # Extract structured data with a schema result = app.scrape_url( "https://example.com/product/123", params={ "formats": ["extract"], "extract": { "schema": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "number"}, "description": {"type": "string"}, }, } }, }, ) print(result["extract"]) # {"name": "...", "price": 29.99, ...} # Crawl an entire website crawl_result = app.crawl_url( "https://example.com", params={"limit": 100, "scrapeOptions": {"formats": ["markdown"]}}, ) for page in crawl_result["data"]: print(page["metadata"]["title"], page["markdown"][:100]) # Map all URLs on a site map_result = app.map_url("https://example.com") print(f"Found {len(map_result['links'])} URLs") ```

Alternatives / Similar


Was this page helpful?