Skip to content

collyvsralger

Apache-2.0 181 5 22,036
May 14 2018 v2.1.0(3 years ago)
153 1 3 MIT
2.2.4(3 years ago) Dec 22 2019 1.2 thousand (month)

Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.

Colly supports:

  • Concurrent scraping with a simple API
  • Automatic handling of cookies and sessions
  • Automatic handling of redirects
  • Support for parsing HTML and XML
  • Support for parsing JSON and binary data
  • Support for custom storage (e.g. scraping results to a database)
  • Simple JavaScript rendering with Colly's built-in rendering engine.

Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.

Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.

Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  // Instantiate default collector
  c := colly.NewCollector(
    // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
    colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
  )

  // On every a element which has href attribute call callback
  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    // Print link
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    // Visit link found on page
    // Only those links are visited which are in AllowedDomains
    c.Visit(e.Request.AbsoluteURL(link))
  })

  // Before making a request print "Visiting ..."
  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  // Start scraping on https://hackerspaces.org
  c.Visit("https://hackerspaces.org/")
}
library("ralger")

url <- "http://www.shanghairanking.com/rankings/arwu/2021"

# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#>  [1] "Harvard University"
#>  [2] "Stanford University"
#>  [3] "University of Cambridge"
#>  [4] "Massachusetts Institute of Technology (MIT)"
#>  [5] "University of California, Berkeley"

# ralger can also parse HTML attributes
attributes <- attribute_scrap(
  link = "https://ropensci.org/",
  node = "a", # the a tag
  attr = "class" # getting the class attribute
)

head(attributes, 10) # NA values are a tags without a class attribute
#>  [1] "navbar-brand logo" "nav-link"          NA
#>  [4] NA                  NA                  "nav-link"
#>  [7] NA                  "nav-link"          NA
#> [10] NA
#

# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)
#> # A tibble: 6 × 4
#>    Rank Title                                      `Lifetime Gross`  Year
#>   <int> <chr>                                      <chr>            <int>
#> 1     1 Avatar                                     $2,847,397,339    2009
#> 2     2 Avengers: Endgame                          $2,797,501,328    2019
#> 3     3 Titanic                                    $2,201,647,264    1997
#> 4     4 Star Wars: Episode VII - The Force Awakens $2,069,521,700    2015
#> 5     5 Avengers: Infinity War                     $2,048,359,754    2018
#> 6     6 Spider-Man: No Way Home                    $1,901,216,740    2021

Alternatives / Similar