Skip to content

node-crawlervscolly

MIT 30 6 6,790
15.3 thousand (month) Sep 10 2012 2.0.2(2025-05-28 09:36:01 ago)
25,231 7 187 Apache-2.0
May 14 2018 v2.2.0(2025-03-27 10:47:28 ago)

node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.

Features:

  • Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
  • Configurable pool size and retries,
  • Control rate limit,
  • Priority queue of requests,
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion,
  • Compatible with 4.x or newer version.
  • Http2 support
  • Proxy support

Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.

Colly supports:

  • Concurrent scraping with a simple API
  • Automatic handling of cookies and sessions
  • Automatic handling of redirects
  • Support for parsing HTML and XML
  • Support for parsing JSON and binary data
  • Support for custom storage (e.g. scraping results to a database)
  • Simple JavaScript rendering with Colly's built-in rendering engine.

Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.

Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.

Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


```javascript const Crawler = require('crawler'); const c = new Crawler({ maxConnections: 10, // This will be called for each crawled page callback: (error, res, done) => { if (error) { console.log(error); } else { const $ = res.$; // $ is Cheerio by default //a lean implementation of core jQuery designed specifically for the server console.log($('title').text()); } done(); } }); // Queue just one URL, with default callback c.queue('http://www.amazon.com'); // Queue a list of URLs c.queue(['http://www.google.com/','http://www.yahoo.com']); // Queue URLs with custom callbacks & parameters c.queue([{ uri: 'http://parishackers.org/', jQuery: false, // The global callback won't be called callback: (error, res, done) => { if (error) { console.log(error); } else { console.log('Grabbed', res.body.length, 'bytes'); } done(); } }]); // Queue some HTML code directly without grabbing (mostly for tests) c.queue([{ html: '

This is a test

' }]); ```
```go package main import ( "fmt" "github.com/gocolly/colly/v2" ) func main() { // Instantiate default collector c := colly.NewCollector( // Visit only domains: hackerspaces.org, wiki.hackerspaces.org colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Printf("Link found: %q -> %s\n", e.Text, link) // Visit link found on page // Only those links are visited which are in AllowedDomains c.Visit(e.Request.AbsoluteURL(link)) }) // Before making a request print "Visiting ..." c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL.String()) }) // Start scraping on https://hackerspaces.org c.Visit("https://hackerspaces.org/") } ```

Alternatives / Similar


Was this page helpful?