node-crawlervsgerapy
node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.
Features:
- Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
- Configurable pool size and retries,
- Control rate limit,
- Priority queue of requests,
- forceUTF8 mode to let crawler deal for you with charset detection and conversion,
- Compatible with 4.x or newer version.
- Http2 support
- Proxy support
Gerapy is a Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.
It is built on top of the Scrapy framework and provides a simple and easy-to-use interface for performing web scraping tasks. Gerapy also includes features such as support for scheduling and distributed crawling, as well as a built-in web-based dashboard for monitoring and managing scraping tasks. Additionally, Gerapy is designed to be highly extensible, allowing users to easily create custom plugins and integrations.
Overall, Gerapy is a useful tool for those looking to automate web scraping tasks and extract data from websites.
Example Use
const Crawler = require('crawler');
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($('title').text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]);