Skip to content

scrapegraphaivswombat

MIT 4 17 23,278
59.6 thousand (month) Jan 15 2024 1.76.0(2026-04-09 09:41:03 ago)
1,360 2 24 MIT
Dec 27 2011 1.4 thousand (month) 3.3.0(2026-04-07 16:31:34 ago)

ScrapeGraphAI is a Python library that uses large language models (LLMs) to create web scraping pipelines automatically. Instead of writing CSS selectors or XPath expressions, you describe what data you want in natural language and provide a Pydantic schema — the library handles the rest.

Key features include:

  • Natural language extraction Describe what you want to extract in plain English (e.g., "Extract all product names and prices") and the LLM figures out how to find and extract the data.
  • Pydantic schema output Define the expected output structure using Pydantic models for type-safe, validated extraction results.
  • Graph-based pipeline Built on a directed graph architecture where each node performs a specific task (fetching, parsing, extracting, merging). This makes pipelines modular and debuggable.
  • Multiple graph types SmartScraperGraph (single page), SearchGraph (search + scrape), SpeechGraph (audio output), and more specialized pipelines.
  • Multiple LLM providers Works with OpenAI, Anthropic, Google, Groq, local models via Ollama, and more.
  • HTML and JSON support Can extract data from both HTML pages and JSON API responses.

ScrapeGraphAI is particularly useful for rapid prototyping of scrapers and for extracting data from pages with complex or frequently changing layouts where traditional selectors would be brittle.

Wombat is a Ruby gem that makes it easy to scrape websites and extract structured data from HTML pages. It is built on top of Nokogiri, a popular Ruby gem for parsing and searching HTML and XML documents, and it provides a simple and intuitive API for defining and running web scraping operations.

One of the main features of Wombat is its ability to extract structured data from HTML pages using a simple, CSS-like syntax. It allows you to define a set of rules for extracting data from a page, and then automatically applies those rules to the page's HTML to extract the desired data. This makes it easy to extract data from even complex and dynamic pages, without having to write a lot of custom code.

In addition to its data extraction capabilities, Wombat also provides a variety of other features that can simplify the web scraping process. It can automatically follow links and scrape multiple pages, it can handle pagination and AJAX requests, and it can handle cookies and authentication. It also provides a built-in support for parallelism and queueing to speed up the scraping process.

Highlights


ai-poweredpopular

Example Use


```python from scrapegraphai.graphs import SmartScraperGraph from pydantic import BaseModel, Field from typing import List # Define the output schema class Product(BaseModel): name: str = Field(description="Product name") price: float = Field(description="Price in USD") rating: float = Field(description="Customer rating out of 5") class ProductList(BaseModel): products: List[Product] # Create a scraping graph with natural language instruction graph = SmartScraperGraph( prompt="Extract all products with their names, prices, and ratings", source="https://example.com/products", schema=ProductList, config={ "llm": { "model": "openai/gpt-4o", "api_key": "YOUR_API_KEY", }, }, ) # Run the graph result = graph.run() for product in result["products"]: print(f"{product['name']}: ${product['price']} ({product['rating']}/5)") ```
```ruby require 'wombat' Wombat.crawl do base_url "https://www.github.com" path "/" headline xpath: "//h1" subheading css: "p.alt-lead" what_is({ css: ".one-fourth h4" }, :list) links do explore xpath: '/html/body/header/div/div/nav[1]/a[4]' do |e| e.gsub(/Explore/, "Love") end features css: '.nav-item-opensource' business css: '.nav-item-business' end end ``` will result in: ```json { "headline"=>"How people build software", "subheading"=>"Millions of developers use GitHub to build personal projects, support their businesses, and work together on open source technologies.", "what_is"=>[ "For everything you build", "A better way to work", "Millions of projects", "One platform, from start to finish" ], "links"=>{ "explore"=>"Love", "features"=>"Open source", "business"=>"Business" } } ```

Alternatives / Similar


Was this page helpful?