ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides
a structured approach to building web scrapers with built-in support for multiple browser
engines, session management, and data pipelines.
Key features include:
- Multiple engine support
Can use different backends depending on the scraping needs: Mechanize for simple HTTP
requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and
Poltergeist (PhantomJS) for lightweight rendering.
- Scrapy-like architecture
Follows the spider pattern: define a spider class with start URLs and parsing methods,
and the framework handles crawling, scheduling, and data collection.
- Built-in data pipelines
Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
- Session management
Maintains browser sessions with automatic cookie handling and configurable delays
between requests.
- Request scheduling
Built-in request queue with configurable concurrency, delays, and retry logic.
- CLI tools
Command-line tools for generating new spiders, running individual spiders, and
managing scraping projects.
Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured
scraping projects that need organization, multiple spiders, and data pipeline processing.
Note: Kimurai has not seen active development recently, but it remains a useful
framework for Ruby scraping projects and is included as the most complete Ruby
scraping framework available.
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```
```ruby
require 'kimurai'
class ProductSpider < Kimurai::Base
@name = 'product_spider'
@engine = :selenium_chrome # or :mechanize for simple pages
@start_urls = ['https://example.com/products']
def parse(response, url:, data: {})
# Extract product data from current page
response.css('.product').each do |product|
item = {
name: product.css('.name').text.strip,
price: product.css('.price').text.strip,
url: absolute_url(product.at_css('a')['href'], base: url),
}
# Send item to the pipeline
save_to "products.json", item, format: :json
end
# Follow pagination links
if next_page = response.at_css('a.next-page')
request_to :parse, url: absolute_url(next_page['href'], base: url)
end
end
end
# Run the spider
ProductSpider.crawl!
```