Skip to content


MIT 34 5 6,609
39.9 thousand (month) Sep 10 2012 1.5.0(3 months ago)
1,320 1 5 MIT
v0.7.2(3 months ago) Mar 16 2013 65 (month)

node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.


  • Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
  • Configurable pool size and retries,
  • Control rate limit,
  • Priority queue of requests,
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion,
  • Compatible with 4.x or newer version.
  • Http2 support
  • Proxy support

php-spider is a PHP library for web crawling and scraping. It allows developers to easily navigate and extract data from websites by simulating a web browser's behavior.

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Example Use

const Crawler = require('crawler');

const c = new Crawler({
    maxConnections: 10,
    // This will be called for each crawled page
    callback: (error, res, done) => {
        if (error) {
        } else {
            const $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server

// Queue just one URL, with default callback

// Queue a list of URLs

// Queue URLs with custom callbacks & parameters
    uri: '',
    jQuery: false,

    // The global callback won't be called
    callback: (error, res, done) => {
        if (error) {
        } else {
            console.log('Grabbed', res.body.length, 'bytes');

// Queue some HTML code directly without grabbing (mostly for tests)
    html: '<p>This is a <strong>test</strong></p>'
use Example\StatsHandler;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;
use Symfony\Contracts\EventDispatcher\Event;
use VDB\Spider\Event\SpiderEvents;
use VDB\Spider\Spider;


// Create Spider
$spider = new Spider('');

// Add a URI discoverer. Without it, the spider does nothing. In this case, we want <a> tags from a certain <div>
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

// Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

// Let's add something to enable us to stop the script
    function (Event $event) {
        echo "\nCrawl aborted by user.\n";

// Add a listener to collect stats to the Spider and the QueueMananger.
// There are more components that dispatch events you can use.
$statsHandler = new StatsHandler();

// Execute crawl

// Build a report
echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

// Finally we could do some processing on the downloaded resources
// In this example, we will echo the title of all resources
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();

Alternatives / Similar