node-crawlervscrwlr-crawler
node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.
Features:
- Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
- Configurable pool size and retries,
- Control rate limit,
- Priority queue of requests,
- forceUTF8 mode to let crawler deal for you with charset detection and conversion,
- Compatible with 4.x or newer version.
- Http2 support
- Proxy support
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
Example Use
const Crawler = require('crawler');
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($('title').text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]);
<?php
require_once 'vendor/autoload.php';
use Crwlr\Crawler;
$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);
// more links can be followed:
$crawler->followLinks();
// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();