Skip to content

collyvsgocrawl

Apache-2.0 195 5 23,241
May 14 2018 v2.1.0(4 years ago)
2,040 2 6 BSD-3-Clause
Nov 20 2016 58.1 thousand (month) (3 years ago)

Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.

Colly supports:

  • Concurrent scraping with a simple API
  • Automatic handling of cookies and sessions
  • Automatic handling of redirects
  • Support for parsing HTML and XML
  • Support for parsing JSON and binary data
  • Support for custom storage (e.g. scraping results to a database)
  • Simple JavaScript rendering with Colly's built-in rendering engine.

Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.

Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.

Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.

Gocrawl is a polite, slim and concurrent web crawler library written in Go. It is designed to be simple and easy to use, while still providing a high degree of flexibility and control over the crawling process.

One of the key features of Gocrawl is its politeness, which means that it obeys the website's robots.txt file and respects the crawl-delay specified in the file. It also takes into account the website's last modified date, if any, to avoid recrawling the same page. This helps to reduce the load on the website and prevent any potential legal issues. Gocrawl is also highly concurrent, which allows it to efficiently crawl large numbers of pages in parallel. This helps to speed up the crawling process and reduce the time required to complete the task.

The library also offers a high degree of flexibility in customizing the crawling process. It allows you to specify custom callbacks and handlers for handling different types of pages, such as error pages, redirects, and so on. This allows you to handle and process the pages as per your requirement. Additionally, Gocrawl provides various functionalities such as support for cookies, user-agent, auto-detection of links, and auto-detection of sitemaps.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  // Instantiate default collector
  c := colly.NewCollector(
    // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
    colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
  )

  // On every a element which has href attribute call callback
  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    // Print link
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    // Visit link found on page
    // Only those links are visited which are in AllowedDomains
    c.Visit(e.Request.AbsoluteURL(link))
  })

  // Before making a request print "Visiting ..."
  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  // Start scraping on https://hackerspaces.org
  c.Visit("https://hackerspaces.org/")
}
// Only enqueue the root and paths beginning with an "a"
var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`)

// Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
// because we don't want/need to override all methods.
type ExampleExtender struct {
    gocrawl.DefaultExtender // Will use the default implementation of all but Visit and Filter
}

// Override Visit for our need.
func (x *ExampleExtender) Visit(ctx *gocrawl.URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
    // Use the goquery document or res.Body to manipulate the data
    // ...

    // Return nil and true - let gocrawl find the links
    return nil, true
}

// Override Filter for our need.
func (x *ExampleExtender) Filter(ctx *gocrawl.URLContext, isVisited bool) bool {
    return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String())
}

func ExampleCrawl() {
    // Set custom options
    opts := gocrawl.NewOptions(new(ExampleExtender))

    // should always set your robot name so that it looks for the most
    // specific rules possible in robots.txt.
    opts.RobotUserAgent = "Example"
    // and reflect that in the user-agent string used to make requests,
    // ideally with a link so site owners can contact you if there's an issue
    opts.UserAgent = "Mozilla/5.0 (compatible; Example/1.0; +http://example.com)"

    opts.CrawlDelay = 1 * time.Second
    opts.LogFlags = gocrawl.LogAll

    // Play nice with ddgo when running the test!
    opts.MaxVisits = 2

    // Create crawler and start at root of duckduckgo
    c := gocrawl.NewCrawlerWithOptions(opts)
    c.Run("https://duckduckgo.com/")

    // Remove "x" before Output: to activate the example (will run on go test)

    // xOutput: voluntarily fail to see log output
}

Alternatives / Similar


Was this page helpful?