collyvsnode-crawler
Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.
Colly supports:
- Concurrent scraping with a simple API
- Automatic handling of cookies and sessions
- Automatic handling of redirects
- Support for parsing HTML and XML
- Support for parsing JSON and binary data
- Support for custom storage (e.g. scraping results to a database)
- Simple JavaScript rendering with Colly's built-in rendering engine.
Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.
Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.
Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.
node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.
Features:
- Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
- Configurable pool size and retries,
- Control rate limit,
- Priority queue of requests,
- forceUTF8 mode to let crawler deal for you with charset detection and conversion,
- Compatible with 4.x or newer version.
- Http2 support
- Proxy support
Highlights
Example Use
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
const Crawler = require('crawler');
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($('title').text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]);