Skip to content

ralgervsgeziyor

MIT 3 1 165
327 (month) Dec 22 2019 2.3.0(2021-03-18 00:10:00 ago)
2,772 1 30 MPL-2.0
Jun 06 2019 2026-04-11(2026-04-11 21:30:25 ago)

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

Features:

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8
  • Proxy management (Single, Round-Robin, Custom)

Example Use


```r library("ralger") url <- "http://www.shanghairanking.com/rankings/arwu/2021" # retrieve HTML and select elements using CSS selectors: best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5) #> [1] "Harvard University" #> [2] "Stanford University" #> [3] "University of Cambridge" #> [4] "Massachusetts Institute of Technology (MIT)" #> [5] "University of California, Berkeley" # ralger can also parse HTML attributes attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute ) head(attributes, 10) # NA values are a tags without a class attribute #> [1] "navbar-brand logo" "nav-link" NA #> [4] NA NA "nav-link" #> [7] NA "nav-link" NA #> [10] NA # # ralger can automatically scrape tables: data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW") head(data) #> # A tibble: 6 × 4 #> Rank Title `Lifetime Gross` Year #> #> 1 1 Avatar $2,847,397,339 2009 #> 2 2 Avengers: Endgame $2,797,501,328 2019 #> 3 3 Titanic $2,201,647,264 1997 #> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015 #> 5 5 Avengers: Infinity War $2,048,359,754 2018 #> 6 6 Spider-Man: No Way Home $1,901,216,740 2021 ```
```go // This example extracts all quotes from quotes.toscrape.com and exports to JSON file. func main() { geziyor.NewGeziyor(&geziyor.Options{ StartURLs: []string{"http://quotes.toscrape.com/"}, ParseFunc: quotesParse, Exporters: []export.Exporter{&export.JSON{}}, }).Start() } func quotesParse(g *geziyor.Geziyor, r *client.Response) { r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) { g.Exports <- map[string]interface{}{ "text": s.Find("span.text").Text(), "author": s.Find("small.author").Text(), } }) if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok { g.Get(r.JoinURL(href), quotesParse) } } ```

Alternatives / Similar


Was this page helpful?