Ruia is an async web scraping micro-framework, written with asyncio and aiohttp,
aims to make crawling url as convenient as possible.
Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.
It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.
Firecrawl is an AI-powered web scraping API that converts web pages into clean Markdown or
structured data, optimized for use with large language models (LLMs) and retrieval-augmented
generation (RAG) pipelines. It handles JavaScript rendering, anti-bot bypass, and content
extraction automatically.
Firecrawl offers multiple modes:
- Scrape
Convert a single URL into clean Markdown, HTML, or structured data. Handles JavaScript
rendering and anti-bot protections automatically.
- Crawl
Crawl an entire website starting from a URL, with configurable depth, URL patterns,
and page limits. Returns all pages as clean Markdown.
- Map
Quickly discover all URLs on a website without fully scraping each page. Useful for
sitemap generation and crawl planning.
- Extract
Use LLMs to extract specific structured data from pages based on a schema definition.
Key features:
- Clean Markdown output ideal for LLM context windows
- Automatic JavaScript rendering with headless browsers
- Built-in anti-bot bypass for protected websites
- Structured extraction with JSON schemas
- Batch crawling with webhook notifications
- Python and JavaScript SDKs
Firecrawl is a commercial API service (requires API key, has a free tier) backed by
Y Combinator. It has become one of the most popular tools for feeding web content
into AI applications and is widely used in the LLM/RAG ecosystem.
Note: while the primary service is an API, the core is open source and can be self-hosted.
```python
#!/usr/bin/env python
"""
Target: https://news.ycombinator.com/
pip install aiofiles
"""
import aiofiles
from ruia import AttrField, Item, Spider, TextField
class HackerNewsItem(Item):
target_item = TextField(css_select="tr.athing")
title = TextField(css_select="a.storylink")
url = AttrField(css_select="a.storylink", attr="href")
async def clean_title(self, value):
return value.strip()
class HackerNewsSpider(Spider):
start_urls = [
"https://news.ycombinator.com/news?p=1",
"https://news.ycombinator.com/news?p=2",
]
concurrency = 10
# aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
async with aiofiles.open("./hacker_news.txt", "a") as f:
self.logger.info(item)
await f.write(str(item.title) + "\n")
if __name__ == "__main__":
HackerNewsSpider.start(middleware=None)
```
```python
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
# Scrape a single page - get clean markdown
result = app.scrape_url("https://example.com/blog/article")
print(result["markdown"]) # clean markdown content
# Extract structured data with a schema
result = app.scrape_url(
"https://example.com/product/123",
params={
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
},
}
},
},
)
print(result["extract"]) # {"name": "...", "price": 29.99, ...}
# Crawl an entire website
crawl_result = app.crawl_url(
"https://example.com",
params={"limit": 100, "scrapeOptions": {"formats": ["markdown"]}},
)
for page in crawl_result["data"]:
print(page["metadata"]["title"], page["markdown"][:100])
# Map all URLs on a site
map_result = app.map_url("https://example.com")
print(f"Found {len(map_result['links'])} URLs")
```