jsdomvsxml2
jsdom is a pure JavaScript implementation of web standards, notably the WHATWG DOM and HTML standards, for use with Node.js. It simulates a browser environment in Node.js, allowing you to parse HTML, manipulate the DOM, and interact with web pages using the same APIs available in web browsers.
Key features for web scraping:
- Full DOM implementation Provides document.querySelector, document.querySelectorAll, and other standard DOM methods for traversing and manipulating parsed HTML.
- Browser-like environment Simulates window, document, navigator, and other browser globals, enabling code that was written for browsers to run in Node.js.
- JavaScript execution Can execute JavaScript embedded in HTML pages, including external scripts, making it possible to process pages that generate content dynamically (though much slower than a real browser).
- Standards-compliant parsing Uses the same HTML parsing algorithm as web browsers (the WHATWG HTML specification), ensuring accurate handling of malformed HTML.
- Cookie support Implements the tough-cookie library for cookie handling across requests.
For web scraping, jsdom is useful when you need more than simple CSS selector matching (what cheerio provides) but don't need a full browser. It's ideal for parsing complex HTML and running simple inline scripts without the overhead of Playwright or Puppeteer. However, for heavy JavaScript-rendered pages, a real browser automation tool is recommended.
The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.
xml2 can be used to parse HTML documents using XPath selectors and is a successor to R's XML package with a few improvements:
- xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
- xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
- More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.
Highlights
Example Use
Product A
$10.99Product B
$24.99
</body>
`;
const dom = new JSDOM(html); const document = dom.window.document;
// Use standard DOM APIs to extract data
const products = document.querySelectorAll('.product');
products.forEach(product => {
const name = product.querySelector('h2').textContent;
const price = product.querySelector('.price').textContent;
console.log(${name}: ${price});
});
// Fetch and parse a remote page JSDOM.fromURL('https://example.com').then(dom => { const title = dom.window.document.title; console.log('Page title:', title); }); ```
```r
library("xml2")
x <- read_xml("
xml_name(x) xml_children(x) xml_text(x) xml_find_all(x, ".//baz")
h <- read_html("
Hi !") h xml_name(h) ```