Skip to content

node-crawlervskimurai

MIT 30 6 6,790
15.3 thousand (month) Sep 10 2012 2.0.2(2025-05-28 09:36:01 ago)
1,098 1 14 MIT
Aug 23 2018 2.4 thousand (month) 2.2.0(2026-01-27 17:36:19 ago)

node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.

Features:

  • Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
  • Configurable pool size and retries,
  • Control rate limit,
  • Priority queue of requests,
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion,
  • Compatible with 4.x or newer version.
  • Http2 support
  • Proxy support

Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides a structured approach to building web scrapers with built-in support for multiple browser engines, session management, and data pipelines.

Key features include:

  • Multiple engine support Can use different backends depending on the scraping needs: Mechanize for simple HTTP requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and Poltergeist (PhantomJS) for lightweight rendering.
  • Scrapy-like architecture Follows the spider pattern: define a spider class with start URLs and parsing methods, and the framework handles crawling, scheduling, and data collection.
  • Built-in data pipelines Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
  • Session management Maintains browser sessions with automatic cookie handling and configurable delays between requests.
  • Request scheduling Built-in request queue with configurable concurrency, delays, and retry logic.
  • CLI tools Command-line tools for generating new spiders, running individual spiders, and managing scraping projects.

Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured scraping projects that need organization, multiple spiders, and data pipeline processing.

Note: Kimurai has not seen active development recently, but it remains a useful framework for Ruby scraping projects and is included as the most complete Ruby scraping framework available.

Highlights


middlewaresoutput-pipelines

Example Use


```javascript const Crawler = require('crawler'); const c = new Crawler({ maxConnections: 10, // This will be called for each crawled page callback: (error, res, done) => { if (error) { console.log(error); } else { const $ = res.$; // $ is Cheerio by default //a lean implementation of core jQuery designed specifically for the server console.log($('title').text()); } done(); } }); // Queue just one URL, with default callback c.queue('http://www.amazon.com'); // Queue a list of URLs c.queue(['http://www.google.com/','http://www.yahoo.com']); // Queue URLs with custom callbacks & parameters c.queue([{ uri: 'http://parishackers.org/', jQuery: false, // The global callback won't be called callback: (error, res, done) => { if (error) { console.log(error); } else { console.log('Grabbed', res.body.length, 'bytes'); } done(); } }]); // Queue some HTML code directly without grabbing (mostly for tests) c.queue([{ html: '

This is a test

' }]); ```
```ruby require 'kimurai' class ProductSpider < Kimurai::Base @name = 'product_spider' @engine = :selenium_chrome # or :mechanize for simple pages @start_urls = ['https://example.com/products'] def parse(response, url:, data: {}) # Extract product data from current page response.css('.product').each do |product| item = { name: product.css('.name').text.strip, price: product.css('.price').text.strip, url: absolute_url(product.at_css('a')['href'], base: url), } # Send item to the pipeline save_to "products.json", item, format: :json end # Follow pagination links if next_page = response.at_css('a.next-page') request_to :parse, url: absolute_url(next_page['href'], base: url) end end end # Run the spider ProductSpider.crawl! ```

Alternatives / Similar


Was this page helpful?