Skip to content

scrapegraphaivsralger

MIT 4 17 23,278
59.6 thousand (month) Jan 15 2024 1.76.0(2026-04-09 09:41:03 ago)
165 1 3 MIT
Dec 22 2019 327 (month) 2.3.0(2021-03-18 00:10:00 ago)

ScrapeGraphAI is a Python library that uses large language models (LLMs) to create web scraping pipelines automatically. Instead of writing CSS selectors or XPath expressions, you describe what data you want in natural language and provide a Pydantic schema — the library handles the rest.

Key features include:

  • Natural language extraction Describe what you want to extract in plain English (e.g., "Extract all product names and prices") and the LLM figures out how to find and extract the data.
  • Pydantic schema output Define the expected output structure using Pydantic models for type-safe, validated extraction results.
  • Graph-based pipeline Built on a directed graph architecture where each node performs a specific task (fetching, parsing, extracting, merging). This makes pipelines modular and debuggable.
  • Multiple graph types SmartScraperGraph (single page), SearchGraph (search + scrape), SpeechGraph (audio output), and more specialized pipelines.
  • Multiple LLM providers Works with OpenAI, Anthropic, Google, Groq, local models via Ollama, and more.
  • HTML and JSON support Can extract data from both HTML pages and JSON API responses.

ScrapeGraphAI is particularly useful for rapid prototyping of scrapers and for extracting data from pages with complex or frequently changing layouts where traditional selectors would be brittle.

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Highlights


ai-poweredpopular

Example Use


```python from scrapegraphai.graphs import SmartScraperGraph from pydantic import BaseModel, Field from typing import List # Define the output schema class Product(BaseModel): name: str = Field(description="Product name") price: float = Field(description="Price in USD") rating: float = Field(description="Customer rating out of 5") class ProductList(BaseModel): products: List[Product] # Create a scraping graph with natural language instruction graph = SmartScraperGraph( prompt="Extract all products with their names, prices, and ratings", source="https://example.com/products", schema=ProductList, config={ "llm": { "model": "openai/gpt-4o", "api_key": "YOUR_API_KEY", }, }, ) # Run the graph result = graph.run() for product in result["products"]: print(f"{product['name']}: ${product['price']} ({product['rating']}/5)") ```
```r library("ralger") url <- "http://www.shanghairanking.com/rankings/arwu/2021" # retrieve HTML and select elements using CSS selectors: best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5) #> [1] "Harvard University" #> [2] "Stanford University" #> [3] "University of Cambridge" #> [4] "Massachusetts Institute of Technology (MIT)" #> [5] "University of California, Berkeley" # ralger can also parse HTML attributes attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute ) head(attributes, 10) # NA values are a tags without a class attribute #> [1] "navbar-brand logo" "nav-link" NA #> [4] NA NA "nav-link" #> [7] NA "nav-link" NA #> [10] NA # # ralger can automatically scrape tables: data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW") head(data) #> # A tibble: 6 × 4 #> Rank Title `Lifetime Gross` Year #> #> 1 1 Avatar $2,847,397,339 2009 #> 2 2 Avengers: Endgame $2,797,501,328 2019 #> 3 3 Titanic $2,201,647,264 1997 #> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015 #> 5 5 Avengers: Infinity War $2,048,359,754 2018 #> 6 6 Spider-Man: No Way Home $1,901,216,740 2021 ```

Alternatives / Similar


Was this page helpful?