ScrapeGraphAI is a Python library that uses large language models (LLMs) to create web
scraping pipelines automatically. Instead of writing CSS selectors or XPath expressions,
you describe what data you want in natural language and provide a Pydantic schema — the
library handles the rest.
Key features include:
- Natural language extraction
Describe what you want to extract in plain English (e.g., "Extract all product names
and prices") and the LLM figures out how to find and extract the data.
- Pydantic schema output
Define the expected output structure using Pydantic models for type-safe, validated
extraction results.
- Graph-based pipeline
Built on a directed graph architecture where each node performs a specific task
(fetching, parsing, extracting, merging). This makes pipelines modular and debuggable.
- Multiple graph types
SmartScraperGraph (single page), SearchGraph (search + scrape), SpeechGraph (audio output),
and more specialized pipelines.
- Multiple LLM providers
Works with OpenAI, Anthropic, Google, Groq, local models via Ollama, and more.
- HTML and JSON support
Can extract data from both HTML pages and JSON API responses.
ScrapeGraphAI is particularly useful for rapid prototyping of scrapers and for extracting
data from pages with complex or frequently changing layouts where traditional selectors
would be brittle.
ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
```python
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List
# Define the output schema
class Product(BaseModel):
name: str = Field(description="Product name")
price: float = Field(description="Price in USD")
rating: float = Field(description="Customer rating out of 5")
class ProductList(BaseModel):
products: List[Product]
# Create a scraping graph with natural language instruction
graph = SmartScraperGraph(
prompt="Extract all products with their names, prices, and ratings",
source="https://example.com/products",
schema=ProductList,
config={
"llm": {
"model": "openai/gpt-4o",
"api_key": "YOUR_API_KEY",
},
},
)
# Run the graph
result = graph.run()
for product in result["products"]:
print(f"{product['name']}: ${product['price']} ({product['rating']}/5)")
```
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```