Skip to content

jsdomvsralger

MIT 412 30 21,552
263.7 million (month) Nov 21 2011 29.0.2(2026-04-07 03:38:38 ago)
165 1 3 MIT
Dec 22 2019 327 (month) 2.3.0(2021-03-18 00:10:00 ago)

jsdom is a pure JavaScript implementation of web standards, notably the WHATWG DOM and HTML standards, for use with Node.js. It simulates a browser environment in Node.js, allowing you to parse HTML, manipulate the DOM, and interact with web pages using the same APIs available in web browsers.

Key features for web scraping:

  • Full DOM implementation Provides document.querySelector, document.querySelectorAll, and other standard DOM methods for traversing and manipulating parsed HTML.
  • Browser-like environment Simulates window, document, navigator, and other browser globals, enabling code that was written for browsers to run in Node.js.
  • JavaScript execution Can execute JavaScript embedded in HTML pages, including external scripts, making it possible to process pages that generate content dynamically (though much slower than a real browser).
  • Standards-compliant parsing Uses the same HTML parsing algorithm as web browsers (the WHATWG HTML specification), ensuring accurate handling of malformed HTML.
  • Cookie support Implements the tough-cookie library for cookie handling across requests.

For web scraping, jsdom is useful when you need more than simple CSS selector matching (what cheerio provides) but don't need a full browser. It's ideal for parsing complex HTML and running simple inline scripts without the overhead of Playwright or Puppeteer. However, for heavy JavaScript-rendered pages, a real browser automation tool is recommended.

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Highlights


popularcss-selectors

Example Use


```javascript const { JSDOM } = require('jsdom'); // Parse an HTML string const html = `

Product A

$10.99

Product B

$24.99

</body>

`;

const dom = new JSDOM(html); const document = dom.window.document;

// Use standard DOM APIs to extract data const products = document.querySelectorAll('.product'); products.forEach(product => { const name = product.querySelector('h2').textContent; const price = product.querySelector('.price').textContent; console.log(${name}: ${price}); });

// Fetch and parse a remote page JSDOM.fromURL('https://example.com').then(dom => { const title = dom.window.document.title; console.log('Page title:', title); }); ```

```r library("ralger")

url <- "http://www.shanghairanking.com/rankings/arwu/2021"

retrieve HTML and select elements using CSS selectors:

best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5)

> [1] "Harvard University"

> [2] "Stanford University"

> [3] "University of Cambridge"

> [4] "Massachusetts Institute of Technology (MIT)"

> [5] "University of California, Berkeley"

ralger can also parse HTML attributes

attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute )

head(attributes, 10) # NA values are a tags without a class attribute

> [1] "navbar-brand logo" "nav-link" NA

> [4] NA NA "nav-link"

> [7] NA "nav-link" NA

> [10] NA

ralger can automatically scrape tables:

data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)

> # A tibble: 6 × 4

> Rank Title Lifetime Gross Year

>

> 1 1 Avatar $2,847,397,339 2009

> 2 2 Avengers: Endgame $2,797,501,328 2019

> 3 3 Titanic $2,201,647,264 1997

> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015

> 5 5 Avengers: Infinity War $2,048,359,754 2018

> 6 6 Spider-Man: No Way Home $1,901,216,740 2021

```

Alternatives / Similar


Was this page helpful?