node-crawlervsscrapy
node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.
Features:
- Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
- Configurable pool size and retries,
- Control rate limit,
- Priority queue of requests,
- forceUTF8 mode to let crawler deal for you with charset detection and conversion,
- Compatible with 4.x or newer version.
- Http2 support
- Proxy support
Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.
Scrapy provides:
- A built-in way to follow links and extract data from multiple pages (crawling)
- Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.
Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.
It also comes with a built-in mechanism for handling common web scraping problems, such as:
- handling HTTP errors
- handling broken links
Scrapy also provide these features:
- Support for storing scraped data in various formats, such as CSV, JSON, and XML.
- Built-in support for selecting and extracting data using XPath or CSS selectors (through
parsel). - Built-in support for handling common web scraping problems (like deduplication and url filtering).
- Ability to easily extend its functionality using middlewares.
- Ability to easily extend output processing using pipelines.
Highlights
popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale
Example Use
```javascript
const Crawler = require('crawler');
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($('title').text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '
This is a test
' }]); ```Alternatives / Similar
katana
new
crawl4ai
new
scrapling
new
crawlee
new
mechanize
new
scrapegraphai
new
botasaurus
new
goutte
new
kimurai
new
firecrawl
new
katana
new
crawl4ai
new
scrapling
new
crawlee
new
mechanize
new
scrapegraphai
new
botasaurus
new
goutte
new
kimurai
new
firecrawl
new