ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
Mechanize is a Ruby library for automating interaction with websites. It automatically
stores and sends cookies, follows redirects, and can submit forms — making it behave
like a web browser without needing an actual browser engine.
Key features include:
- Automatic cookie management
Stores cookies received from servers and sends them back on subsequent requests,
maintaining session state across multiple pages.
- Form handling
Can find, fill in, and submit HTML forms programmatically. Supports text inputs,
selects, checkboxes, radio buttons, and file uploads.
- Link following
Navigate through pages by clicking links using their text content, CSS selectors,
or href patterns.
- History and back/forward
Maintains a browsing history, allowing you to go back and forward through visited pages.
- HTTP authentication
Supports basic and digest HTTP authentication.
- Proxy support
Can route requests through HTTP proxies.
- Redirect handling
Automatically follows HTTP redirects (configurable).
Mechanize is one of the oldest and most established web interaction libraries in Ruby.
It is best suited for scraping traditional server-rendered websites with forms and
multi-page workflows. For JavaScript-heavy sites, a browser automation tool like
Selenium or Playwright is recommended instead.
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```
```ruby
require 'mechanize'
agent = Mechanize.new
# Navigate to a page
page = agent.get('https://example.com')
puts page.title
# Find and click a link
page = page.link_with(text: 'Products').click
# Extract data from the page
page.search('.product').each do |product|
name = product.at('.name').text
price = product.at('.price').text
puts "#{name}: #{price}"
end
# Fill in and submit a login form
login_page = agent.get('https://example.com/login')
form = login_page.form_with(action: '/login')
form['username'] = 'user@example.com'
form['password'] = 'password123'
dashboard = agent.submit(form)
# Cookies are maintained automatically
puts dashboard.title # "Dashboard"
# Download a file
agent.get('https://example.com/report.csv').save('report.csv')
```