Skip to content

node-crawlervsscrapydweb

MIT 34 5 6,609
39.9 thousand (month) Sep 10 2012 1.5.0(3 months ago)
2,983 1 59 GNU General Public License v3.0
1.5.0(a month ago) Sep 30 2018 1.6 thousand (month)

node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.

Features:

  • Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
  • Configurable pool size and retries,
  • Control rate limit,
  • Priority queue of requests,
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion,
  • Compatible with 4.x or newer version.
  • Http2 support
  • Proxy support

ScrapydWeb is a web-based management tool for the Scrapyd service. It is built using the Python Flask framework and allows you to easily manage and monitor your Scrapy spider projects through a web interface.

ScrapydWeb allows you to view the status of your running spiders, view the logs of completed spiders, schedule new spider runs, and manage spider settings and configurations.

ScrapydWeb provides a simple way to manage your scraping tasks and allows you to schedule and run multiple spiders simultaneously. It also provides a user-friendly web interface that makes it easy to view the status of your spiders and monitor their progress.

You can install the package via pip by running pip install scrapydweb and then you can run the package by running scrapydweb command in your command prompt.

It will start a web server that you can access through your web browser at http://localhost:6800/ You will need to have Scrapyd running in order to use ScrapydWeb, Scrapyd is a service for running Scrapy spiders, it allows you to schedule spiders to run at regular intervals and also allows you to run spiders on remote machines.

Example Use


const Crawler = require('crawler');

const c = new Crawler({
    maxConnections: 10,
    // This will be called for each crawled page
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            const $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            console.log($('title').text());
        }
        done();
    }
});

// Queue just one URL, with default callback
c.queue('http://www.amazon.com');

// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
    uri: 'http://parishackers.org/',
    jQuery: false,

    // The global callback won't be called
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            console.log('Grabbed', res.body.length, 'bytes');
        }
        done();
    }
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
    html: '<p>This is a <strong>test</strong></p>'
}]);

Alternatives / Similar