ScrapeGraphAI is a Python library that uses large language models (LLMs) to create web
scraping pipelines automatically. Instead of writing CSS selectors or XPath expressions,
you describe what data you want in natural language and provide a Pydantic schema — the
library handles the rest.
Key features include:
- Natural language extraction
Describe what you want to extract in plain English (e.g., "Extract all product names
and prices") and the LLM figures out how to find and extract the data.
- Pydantic schema output
Define the expected output structure using Pydantic models for type-safe, validated
extraction results.
- Graph-based pipeline
Built on a directed graph architecture where each node performs a specific task
(fetching, parsing, extracting, merging). This makes pipelines modular and debuggable.
- Multiple graph types
SmartScraperGraph (single page), SearchGraph (search + scrape), SpeechGraph (audio output),
and more specialized pipelines.
- Multiple LLM providers
Works with OpenAI, Anthropic, Google, Groq, local models via Ollama, and more.
- HTML and JSON support
Can extract data from both HTML pages and JSON API responses.
ScrapeGraphAI is particularly useful for rapid prototyping of scrapers and for extracting
data from pages with complex or frequently changing layouts where traditional selectors
would be brittle.
Ruia is an async web scraping micro-framework, written with asyncio and aiohttp,
aims to make crawling url as convenient as possible.
Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.
It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.
```python
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List
# Define the output schema
class Product(BaseModel):
name: str = Field(description="Product name")
price: float = Field(description="Price in USD")
rating: float = Field(description="Customer rating out of 5")
class ProductList(BaseModel):
products: List[Product]
# Create a scraping graph with natural language instruction
graph = SmartScraperGraph(
prompt="Extract all products with their names, prices, and ratings",
source="https://example.com/products",
schema=ProductList,
config={
"llm": {
"model": "openai/gpt-4o",
"api_key": "YOUR_API_KEY",
},
},
)
# Run the graph
result = graph.run()
for product in result["products"]:
print(f"{product['name']}: ${product['price']} ({product['rating']}/5)")
```
```python
#!/usr/bin/env python
"""
Target: https://news.ycombinator.com/
pip install aiofiles
"""
import aiofiles
from ruia import AttrField, Item, Spider, TextField
class HackerNewsItem(Item):
target_item = TextField(css_select="tr.athing")
title = TextField(css_select="a.storylink")
url = AttrField(css_select="a.storylink", attr="href")
async def clean_title(self, value):
return value.strip()
class HackerNewsSpider(Spider):
start_urls = [
"https://news.ycombinator.com/news?p=1",
"https://news.ycombinator.com/news?p=2",
]
concurrency = 10
# aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
async with aiofiles.open("./hacker_news.txt", "a") as f:
self.logger.info(item)
await f.write(str(item.title) + "\n")
if __name__ == "__main__":
HackerNewsSpider.start(middleware=None)
```