Skip to content

ralgervsphpscraper

MIT 3 1 165
327 (month) Dec 22 2019 2.3.0(2021-03-18 00:10:00 ago)
583 2 28 GPL-3.0-or-later
May 04 2020 104 (month) 3.0.0(2024-04-09 15:34:59 ago)

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

  • Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
  • File downloading.
  • RSS, Sitemap and other feed processing.
  • CSV, XML and JSON file processing.

Example Use


```r library("ralger") url <- "http://www.shanghairanking.com/rankings/arwu/2021" # retrieve HTML and select elements using CSS selectors: best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5) #> [1] "Harvard University" #> [2] "Stanford University" #> [3] "University of Cambridge" #> [4] "Massachusetts Institute of Technology (MIT)" #> [5] "University of California, Berkeley" # ralger can also parse HTML attributes attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute ) head(attributes, 10) # NA values are a tags without a class attribute #> [1] "navbar-brand logo" "nav-link" NA #> [4] NA NA "nav-link" #> [7] NA "nav-link" NA #> [10] NA # # ralger can automatically scrape tables: data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW") head(data) #> # A tibble: 6 × 4 #> Rank Title `Lifetime Gross` Year #> #> 1 1 Avatar $2,847,397,339 2009 #> 2 2 Avengers: Endgame $2,797,501,328 2019 #> 3 3 Titanic $2,201,647,264 1997 #> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015 #> 5 5 Avengers: Infinity War $2,048,359,754 2018 #> 6 6 Spider-Man: No Way Home $1,901,216,740 2021 ```
```javascript // create scraper object $web = new \Spekulatius\PHPScraper\PHPScraper; // go to URL $web->go('https://test-pages.phpscraper.de/content/selectors.html'); // elements can be found using XPath: echo $web->filter("//*[@id='by-id']")->text(); // "Content by ID" // or pre-defined variables covering basic page data: $web->links; // for all links $web->headings; $web->images; $web->contentKeywords; $web->orderedLists; $web->unorderedLists; $web->paragraphs; $web->outline; // basic page outline $web->cleanOutlineWithParagraphs; // basic page outline ```

Alternatives / Similar


Was this page helpful?