node-crawlervsscrapy
node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.
Features:
- Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
- Configurable pool size and retries,
- Control rate limit,
- Priority queue of requests,
- forceUTF8 mode to let crawler deal for you with charset detection and conversion,
- Compatible with 4.x or newer version.
- Http2 support
- Proxy support
Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.
Scrapy provides:
- A built-in way to follow links and extract data from multiple pages (crawling)
- Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.
Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.
It also comes with a built-in mechanism for handling common web scraping problems, such as:
- handling HTTP errors
- handling broken links
Scrapy also provide these features:
- Support for storing scraped data in various formats, such as CSV, JSON, and XML.
- Built-in support for selecting and extracting data using XPath or CSS selectors (through
parsel
). - Built-in support for handling common web scraping problems (like deduplication and url filtering).
- Ability to easily extend its functionality using middlewares.
- Ability to easily extend output processing using pipelines.
Highlights
popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale
Example Use
const Crawler = require('crawler');
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($('title').text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]);