Skip to content

ralgervsayakashi

MIT 3 1 153
1.2 thousand (month) Dec 22 2019 2.2.4(3 years ago)
195 1 8 AGPL-3.0-only
1.0.0-beta8.4(9 months ago) Apr 18 2019 68 (month)

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Ayakashi is a web scraping library for Node.js that allows developers to easily extract structured data from websites. It is built on top of the popular "puppeteer" library and provides a simple and intuitive API for defining and querying the structure of a website.

Features:

  • Powerful querying and data models
    Ayakashi's way of finding things in the page and using them is done with props and domQL. Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is. Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.
  • High level builtin actions
    Ready made actions so you can focus on what matters. Easily handle infinite scrolling, single page navigation, events and more. Plus, you can always build your own actions, either from scratch or by composing other actions.
  • Preload code on pages
    Need to include a bunch of code, a library you made or a 3rd party module and make it available on a page? Preloaders have you covered.

Example Use


library("ralger")

url <- "http://www.shanghairanking.com/rankings/arwu/2021"

# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#>  [1] "Harvard University"
#>  [2] "Stanford University"
#>  [3] "University of Cambridge"
#>  [4] "Massachusetts Institute of Technology (MIT)"
#>  [5] "University of California, Berkeley"

# ralger can also parse HTML attributes
attributes <- attribute_scrap(
  link = "https://ropensci.org/",
  node = "a", # the a tag
  attr = "class" # getting the class attribute
)

head(attributes, 10) # NA values are a tags without a class attribute
#>  [1] "navbar-brand logo" "nav-link"          NA
#>  [4] NA                  NA                  "nav-link"
#>  [7] NA                  "nav-link"          NA
#> [10] NA
#

# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)
#> # A tibble: 6 × 4
#>    Rank Title                                      `Lifetime Gross`  Year
#>   <int> <chr>                                      <chr>            <int>
#> 1     1 Avatar                                     $2,847,397,339    2009
#> 2     2 Avengers: Endgame                          $2,797,501,328    2019
#> 3     3 Titanic                                    $2,201,647,264    1997
#> 4     4 Star Wars: Episode VII - The Force Awakens $2,069,521,700    2015
#> 5     5 Avengers: Infinity War                     $2,048,359,754    2018
#> 6     6 Spider-Man: No Way Home                    $1,901,216,740    2021
const ayakashi = require("ayakashi");
const myAyakashi = ayakashi.init();

// navigate the browser
await myAyakashi.goTo("https://example.com/product");

// parsing HTML
// first by defnining a selector
myAyakashi
    .select("productList")
    .where({class: {eq: "product-item"}});

// then executing selector on current HTML:
const productList = await myAyakashi.extract("productList");
console.log(productList);

Alternatives / Similar