Skip to content

httrvsralger

MIT 1 9 978
703.8 thousand (month) May 06 2012 1.4.7(9 months ago)
152 1 3 MIT
2.2.4(3 years ago) Dec 22 2019 1.2 thousand (month)

The aim of httr is to provide a wrapper for the curl package, customised to the demands of modern web APIs.

Key features:

  • Functions for the most important http verbs: GET(), HEAD(), PATCH(), PUT(), DELETE() and POST().
  • Automatic connection sharing across requests to the same website (by default, curl handles are managed automatically), cookies are maintained across requests, and a up-to-date root-level SSL certificate store is used.
  • Requests return a standard reponse object that captures the http status line, headers and body, along with other useful information.
  • Response content is available with content() as a raw vector (as = "raw"), a character vector (as = "text"), or parsed into an R object (as = "parsed"), currently for html, xml, json, png and jpeg.
  • You can convert http errors into R errors with stop_for_status().
  • Config functions make it easier to modify the request in common ways: set_cookies(), add_headers(), authenticate(), use_proxy(), verbose(), timeout(), content_type(), accept(), progress().
  • Support for OAuth 1.0 and 2.0 with oauth1.0_token() and oauth2.0_token(). The demo directory has eight OAuth demos: four for 1.0 (twitter, vimeo, withings and yahoo) and four for 2.0 (facebook, github, google, linkedin). OAuth credentials are automatically cached within a project.

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Example Use


library(httr)

# GET requests:
resp <- GET("http://httpbin.org/get")
status_code(resp)  # status code
headers(resp)  # headers
str(content(resp))  # body

# POST requests: 
# Form encoded
resp <- POST(url, body = body, encode = "form")
# Multipart encoded
resp <- POST(url, body = body, encode = "multipart")
# JSON encoded
resp <- POST(url, body = body, encode = "json")

# setting cookies:
resp <- GET("http://httpbin.org/cookies", set_cookies("MeWant" = "cookies"))
content(r)$cookies  # get response cookies
library("ralger")

url <- "http://www.shanghairanking.com/rankings/arwu/2021"

# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#>  [1] "Harvard University"
#>  [2] "Stanford University"
#>  [3] "University of Cambridge"
#>  [4] "Massachusetts Institute of Technology (MIT)"
#>  [5] "University of California, Berkeley"

# ralger can also parse HTML attributes
attributes <- attribute_scrap(
  link = "https://ropensci.org/",
  node = "a", # the a tag
  attr = "class" # getting the class attribute
)

head(attributes, 10) # NA values are a tags without a class attribute
#>  [1] "navbar-brand logo" "nav-link"          NA
#>  [4] NA                  NA                  "nav-link"
#>  [7] NA                  "nav-link"          NA
#> [10] NA
#

# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)
#> # A tibble: 6 × 4
#>    Rank Title                                      `Lifetime Gross`  Year
#>   <int> <chr>                                      <chr>            <int>
#> 1     1 Avatar                                     $2,847,397,339    2009
#> 2     2 Avengers: Endgame                          $2,797,501,328    2019
#> 3     3 Titanic                                    $2,201,647,264    1997
#> 4     4 Star Wars: Episode VII - The Force Awakens $2,069,521,700    2015
#> 5     5 Avengers: Infinity War                     $2,048,359,754    2018
#> 6     6 Spider-Man: No Way Home                    $1,901,216,740    2021

Alternatives / Similar