gerapyvscolly

MIT 72 4 3,365

1.7 thousand (month) Jul 04 2017 0.9.13(2 years ago)

23,747 5 200 Apache-2.0

May 14 2018 v2.1.0(5 years ago)

Gerapy is a Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.

It is built on top of the Scrapy framework and provides a simple and easy-to-use interface for performing web scraping tasks. Gerapy also includes features such as support for scheduling and distributed crawling, as well as a built-in web-based dashboard for monitoring and managing scraping tasks. Additionally, Gerapy is designed to be highly extensible, allowing users to easily create custom plugins and integrations.

Overall, Gerapy is a useful tool for those looking to automate web scraping tasks and extract data from websites.

Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.

Colly supports:

Concurrent scraping with a simple API
Automatic handling of cookies and sessions
Automatic handling of redirects
Support for parsing HTML and XML
Support for parsing JSON and binary data
Support for custom storage (e.g. scraping results to a database)
Simple JavaScript rendering with Colly's built-in rendering engine.

Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.

Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.

Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.

Highlights

popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use

package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  // Instantiate default collector
  c := colly.NewCollector(
    // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
    colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
  )

  // On every a element which has href attribute call callback
  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    // Print link
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    // Visit link found on page
    // Only those links are visited which are in AllowedDomains
    c.Visit(e.Request.AbsoluteURL(link))
  })

  // Before making a request print "Visiting ..."
  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  // Start scraping on https://hackerspaces.org
  c.Visit("https://hackerspaces.org/")
}

Alternatives / Similar

scrapydweb

3,218 compare

scrapy

54,211 compare

scrapyd

2,980 compare

colly

23,747 compare

pholcus

7,580 compare

geziyor

2,667 compare

dataflowkit

676 compare

rvest

1,498 compare

gocrawl

2,039 compare

ferret

5,716 compare

node-crawler

6,733 compare

panther

2,977 compare

autoscraper

6,638 compare

gracy

247 compare

spidr

813 compare

wombat

1,316 compare

splash

4,122 compare

ruia

1,754 compare

photon

11,149 compare

ralger

156 compare

roach

1,384 compare

dude

428 compare

ayakashi

213 compare

phpscraper

554 compare

php-spider

1,335 compare

crwlr-crawler

356 compare