Skip to content

hrequestsvsralger

MIT 22 1 672
10.9 thousand (month) Feb 23 2022 0.8.2(6 months ago)
156 1 3 MIT
Dec 22 2019 289 (month) 2.2.4(3 years ago)

hrequests is a feature rich modern replacement for a famous requests library for Python. It provides a feature rich HTTP client capable of resisting popular scraper identification techniques: - Seamless transition between headless browser and http client based requests - Integrated HTML parser - Mimicking of real browser TLS fingerprints - Javascript rendering - HTTP2 support - Realistic browser headers

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Highlights


bypasshttp2tls-fingerprinthttp-fingerprintsyncasync

Example Use


hrequests has almost identical API and UX as requests and here's a quick overview:
import hrequests

# perform HTTP client requests
resp = hrequests.get('https://httpbin.org/html')
print(resp.status_code)
# 200

# use headless browsers and sessions:
session = hrequests.Session('chrome', version=122, os="mac")

# supports asyncio and easy concurrency
requests = [
    hrequests.async_get('https://www.google.com/', browser='firefox'),
    hrequests.async_get('https://www.duckduckgo.com/'),
    hrequests.async_get('https://www.yahoo.com/'),
    hrequests.async_get('https://www.httpbin.org/'),
]
responses = hrequests.map(requests, size=3)  # max 3 conccurency
library("ralger")

url <- "http://www.shanghairanking.com/rankings/arwu/2021"

# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#>  [1] "Harvard University"
#>  [2] "Stanford University"
#>  [3] "University of Cambridge"
#>  [4] "Massachusetts Institute of Technology (MIT)"
#>  [5] "University of California, Berkeley"

# ralger can also parse HTML attributes
attributes <- attribute_scrap(
  link = "https://ropensci.org/",
  node = "a", # the a tag
  attr = "class" # getting the class attribute
)

head(attributes, 10) # NA values are a tags without a class attribute
#>  [1] "navbar-brand logo" "nav-link"          NA
#>  [4] NA                  NA                  "nav-link"
#>  [7] NA                  "nav-link"          NA
#> [10] NA
#

# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)
#> # A tibble: 6 × 4
#>    Rank Title                                      `Lifetime Gross`  Year
#>   <int> <chr>                                      <chr>            <int>
#> 1     1 Avatar                                     $2,847,397,339    2009
#> 2     2 Avengers: Endgame                          $2,797,501,328    2019
#> 3     3 Titanic                                    $2,201,647,264    1997
#> 4     4 Star Wars: Episode VII - The Force Awakens $2,069,521,700    2015
#> 5     5 Avengers: Infinity War                     $2,048,359,754    2018
#> 6     6 Spider-Man: No Way Home                    $1,901,216,740    2021

Alternatives / Similar


Was this page helpful?