Skip to content

wombatvskimurai

MIT 24 2 1,360
1.4 thousand (month) Dec 27 2011 3.3.0(2026-04-07 16:31:34 ago)
1,098 1 14 MIT
Aug 23 2018 2.4 thousand (month) 2.2.0(2026-01-27 17:36:19 ago)

Wombat is a Ruby gem that makes it easy to scrape websites and extract structured data from HTML pages. It is built on top of Nokogiri, a popular Ruby gem for parsing and searching HTML and XML documents, and it provides a simple and intuitive API for defining and running web scraping operations.

One of the main features of Wombat is its ability to extract structured data from HTML pages using a simple, CSS-like syntax. It allows you to define a set of rules for extracting data from a page, and then automatically applies those rules to the page's HTML to extract the desired data. This makes it easy to extract data from even complex and dynamic pages, without having to write a lot of custom code.

In addition to its data extraction capabilities, Wombat also provides a variety of other features that can simplify the web scraping process. It can automatically follow links and scrape multiple pages, it can handle pagination and AJAX requests, and it can handle cookies and authentication. It also provides a built-in support for parallelism and queueing to speed up the scraping process.

Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides a structured approach to building web scrapers with built-in support for multiple browser engines, session management, and data pipelines.

Key features include:

  • Multiple engine support Can use different backends depending on the scraping needs: Mechanize for simple HTTP requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and Poltergeist (PhantomJS) for lightweight rendering.
  • Scrapy-like architecture Follows the spider pattern: define a spider class with start URLs and parsing methods, and the framework handles crawling, scheduling, and data collection.
  • Built-in data pipelines Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
  • Session management Maintains browser sessions with automatic cookie handling and configurable delays between requests.
  • Request scheduling Built-in request queue with configurable concurrency, delays, and retry logic.
  • CLI tools Command-line tools for generating new spiders, running individual spiders, and managing scraping projects.

Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured scraping projects that need organization, multiple spiders, and data pipeline processing.

Note: Kimurai has not seen active development recently, but it remains a useful framework for Ruby scraping projects and is included as the most complete Ruby scraping framework available.

Highlights


middlewaresoutput-pipelines

Example Use


```ruby require 'wombat' Wombat.crawl do base_url "https://www.github.com" path "/" headline xpath: "//h1" subheading css: "p.alt-lead" what_is({ css: ".one-fourth h4" }, :list) links do explore xpath: '/html/body/header/div/div/nav[1]/a[4]' do |e| e.gsub(/Explore/, "Love") end features css: '.nav-item-opensource' business css: '.nav-item-business' end end ``` will result in: ```json { "headline"=>"How people build software", "subheading"=>"Millions of developers use GitHub to build personal projects, support their businesses, and work together on open source technologies.", "what_is"=>[ "For everything you build", "A better way to work", "Millions of projects", "One platform, from start to finish" ], "links"=>{ "explore"=>"Love", "features"=>"Open source", "business"=>"Business" } } ```
```ruby require 'kimurai' class ProductSpider < Kimurai::Base @name = 'product_spider' @engine = :selenium_chrome # or :mechanize for simple pages @start_urls = ['https://example.com/products'] def parse(response, url:, data: {}) # Extract product data from current page response.css('.product').each do |product| item = { name: product.css('.name').text.strip, price: product.css('.price').text.strip, url: absolute_url(product.at_css('a')['href'], base: url), } # Send item to the pipeline save_to "products.json", item, format: :json end # Follow pagination links if next_page = response.at_css('a.next-page') request_to :parse, url: absolute_url(next_page['href'], base: url) end end end # Run the spider ProductSpider.crawl! ```

Alternatives / Similar


Was this page helpful?