Skip to content

scrapegraphaivsgocrawl

MIT 4 17 23,278
59.6 thousand (month) Jan 15 2024 1.76.0(2026-04-09 09:41:03 ago)
2,053 2 6 BSD-3-Clause
Nov 20 2016 58.1 thousand (month) (2021-05-19 15:14:49 ago)

ScrapeGraphAI is a Python library that uses large language models (LLMs) to create web scraping pipelines automatically. Instead of writing CSS selectors or XPath expressions, you describe what data you want in natural language and provide a Pydantic schema — the library handles the rest.

Key features include:

  • Natural language extraction Describe what you want to extract in plain English (e.g., "Extract all product names and prices") and the LLM figures out how to find and extract the data.
  • Pydantic schema output Define the expected output structure using Pydantic models for type-safe, validated extraction results.
  • Graph-based pipeline Built on a directed graph architecture where each node performs a specific task (fetching, parsing, extracting, merging). This makes pipelines modular and debuggable.
  • Multiple graph types SmartScraperGraph (single page), SearchGraph (search + scrape), SpeechGraph (audio output), and more specialized pipelines.
  • Multiple LLM providers Works with OpenAI, Anthropic, Google, Groq, local models via Ollama, and more.
  • HTML and JSON support Can extract data from both HTML pages and JSON API responses.

ScrapeGraphAI is particularly useful for rapid prototyping of scrapers and for extracting data from pages with complex or frequently changing layouts where traditional selectors would be brittle.

Gocrawl is a polite, slim and concurrent web crawler library written in Go. It is designed to be simple and easy to use, while still providing a high degree of flexibility and control over the crawling process.

One of the key features of Gocrawl is its politeness, which means that it obeys the website's robots.txt file and respects the crawl-delay specified in the file. It also takes into account the website's last modified date, if any, to avoid recrawling the same page. This helps to reduce the load on the website and prevent any potential legal issues. Gocrawl is also highly concurrent, which allows it to efficiently crawl large numbers of pages in parallel. This helps to speed up the crawling process and reduce the time required to complete the task.

The library also offers a high degree of flexibility in customizing the crawling process. It allows you to specify custom callbacks and handlers for handling different types of pages, such as error pages, redirects, and so on. This allows you to handle and process the pages as per your requirement. Additionally, Gocrawl provides various functionalities such as support for cookies, user-agent, auto-detection of links, and auto-detection of sitemaps.

Highlights


ai-poweredpopular

Example Use


```python from scrapegraphai.graphs import SmartScraperGraph from pydantic import BaseModel, Field from typing import List # Define the output schema class Product(BaseModel): name: str = Field(description="Product name") price: float = Field(description="Price in USD") rating: float = Field(description="Customer rating out of 5") class ProductList(BaseModel): products: List[Product] # Create a scraping graph with natural language instruction graph = SmartScraperGraph( prompt="Extract all products with their names, prices, and ratings", source="https://example.com/products", schema=ProductList, config={ "llm": { "model": "openai/gpt-4o", "api_key": "YOUR_API_KEY", }, }, ) # Run the graph result = graph.run() for product in result["products"]: print(f"{product['name']}: ${product['price']} ({product['rating']}/5)") ```
```go // Only enqueue the root and paths beginning with an "a" var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`) // Create the Extender implementation, based on the gocrawl-provided DefaultExtender, // because we don't want/need to override all methods. type ExampleExtender struct { gocrawl.DefaultExtender // Will use the default implementation of all but Visit and Filter } // Override Visit for our need. func (x *ExampleExtender) Visit(ctx *gocrawl.URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) { // Use the goquery document or res.Body to manipulate the data // ... // Return nil and true - let gocrawl find the links return nil, true } // Override Filter for our need. func (x *ExampleExtender) Filter(ctx *gocrawl.URLContext, isVisited bool) bool { return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String()) } func ExampleCrawl() { // Set custom options opts := gocrawl.NewOptions(new(ExampleExtender)) // should always set your robot name so that it looks for the most // specific rules possible in robots.txt. opts.RobotUserAgent = "Example" // and reflect that in the user-agent string used to make requests, // ideally with a link so site owners can contact you if there's an issue opts.UserAgent = "Mozilla/5.0 (compatible; Example/1.0; +http://example.com)" opts.CrawlDelay = 1 * time.Second opts.LogFlags = gocrawl.LogAll // Play nice with ddgo when running the test! opts.MaxVisits = 2 // Create crawler and start at root of duckduckgo c := gocrawl.NewCrawlerWithOptions(opts) c.Run("https://duckduckgo.com/") // Remove "x" before Output: to activate the example (will run on go test) // xOutput: voluntarily fail to see log output } ```

Alternatives / Similar


Was this page helpful?