Skip to content

collyvskimurai

Apache-2.0 187 7 25,231
May 14 2018 v2.2.0(2025-03-27 10:47:28 ago)
1,098 1 14 MIT
Aug 23 2018 2.4 thousand (month) 2.2.0(2026-01-27 17:36:19 ago)

Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.

Colly supports:

  • Concurrent scraping with a simple API
  • Automatic handling of cookies and sessions
  • Automatic handling of redirects
  • Support for parsing HTML and XML
  • Support for parsing JSON and binary data
  • Support for custom storage (e.g. scraping results to a database)
  • Simple JavaScript rendering with Colly's built-in rendering engine.

Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.

Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.

Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.

Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides a structured approach to building web scrapers with built-in support for multiple browser engines, session management, and data pipelines.

Key features include:

  • Multiple engine support Can use different backends depending on the scraping needs: Mechanize for simple HTTP requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and Poltergeist (PhantomJS) for lightweight rendering.
  • Scrapy-like architecture Follows the spider pattern: define a spider class with start URLs and parsing methods, and the framework handles crawling, scheduling, and data collection.
  • Built-in data pipelines Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
  • Session management Maintains browser sessions with automatic cookie handling and configurable delays between requests.
  • Request scheduling Built-in request queue with configurable concurrency, delays, and retry logic.
  • CLI tools Command-line tools for generating new spiders, running individual spiders, and managing scraping projects.

Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured scraping projects that need organization, multiple spiders, and data pipeline processing.

Note: Kimurai has not seen active development recently, but it remains a useful framework for Ruby scraping projects and is included as the most complete Ruby scraping framework available.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale
middlewaresoutput-pipelines

Example Use


```go package main import ( "fmt" "github.com/gocolly/colly/v2" ) func main() { // Instantiate default collector c := colly.NewCollector( // Visit only domains: hackerspaces.org, wiki.hackerspaces.org colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Printf("Link found: %q -> %s\n", e.Text, link) // Visit link found on page // Only those links are visited which are in AllowedDomains c.Visit(e.Request.AbsoluteURL(link)) }) // Before making a request print "Visiting ..." c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL.String()) }) // Start scraping on https://hackerspaces.org c.Visit("https://hackerspaces.org/") } ```
```ruby require 'kimurai' class ProductSpider < Kimurai::Base @name = 'product_spider' @engine = :selenium_chrome # or :mechanize for simple pages @start_urls = ['https://example.com/products'] def parse(response, url:, data: {}) # Extract product data from current page response.css('.product').each do |product| item = { name: product.css('.name').text.strip, price: product.css('.price').text.strip, url: absolute_url(product.at_css('a')['href'], base: url), } # Send item to the pipeline save_to "products.json", item, format: :json end # Follow pagination links if next_page = response.at_css('a.next-page') request_to :parse, url: absolute_url(next_page['href'], base: url) end end end # Run the spider ProductSpider.crawl! ```

Alternatives / Similar


Was this page helpful?