Skip to content

jsdomvshtml5-parser

MIT 412 30 21,552
263.7 million (month) Nov 21 2011 29.0.2(2026-04-07 03:38:38 ago)
700 1 1 Apache-2.0
Jun 03 2007 17.6 thousand (month) 0.4.12(2023-11-19 15:09:54 ago)

jsdom is a pure JavaScript implementation of web standards, notably the WHATWG DOM and HTML standards, for use with Node.js. It simulates a browser environment in Node.js, allowing you to parse HTML, manipulate the DOM, and interact with web pages using the same APIs available in web browsers.

Key features for web scraping:

  • Full DOM implementation Provides document.querySelector, document.querySelectorAll, and other standard DOM methods for traversing and manipulating parsed HTML.
  • Browser-like environment Simulates window, document, navigator, and other browser globals, enabling code that was written for browsers to run in Node.js.
  • JavaScript execution Can execute JavaScript embedded in HTML pages, including external scripts, making it possible to process pages that generate content dynamically (though much slower than a real browser).
  • Standards-compliant parsing Uses the same HTML parsing algorithm as web browsers (the WHATWG HTML specification), ensuring accurate handling of malformed HTML.
  • Cookie support Implements the tough-cookie library for cookie handling across requests.

For web scraping, jsdom is useful when you need more than simple CSS selector matching (what cheerio provides) but don't need a full browser. It's ideal for parsing complex HTML and running simple inline scripts without the overhead of Playwright or Puppeteer. However, for heavy JavaScript-rendered pages, a real browser automation tool is recommended.

html5-parser is a Python library for parsing HTML and XML documents.

A fast implementation of the HTML 5 parsing spec for Python. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x. This differs, for instance, from the gumbo python bindings, where the initial parsing is done in C but the transformation into the final tree is done in python.

It is built on top of the popular lxml library and provides a simple and intuitive API for working with the document's structure.

html5-parser uses the HTML5 parsing algorithm, which is more lenient and forgiving than the traditional XML-based parsing algorithm. This means that it can parse HTML documents with malformed or missing tags and still produce a usable parse tree.

To use html5-parser, you first need to install it via pip by running pip install html5-parser. Once it is installed, you can use the html5_parser.parse() function to parse an HTML document and create a parse tree. For example:

``` from html5_parser import parse

html_string = "Hello, World!" root = parse(html_string) print(root.tag) # html ` You can also use `html5_parser.parse() with file-like objects, bytes or file paths.

Once you have a parse tree, you can use the find() and findall() methods to search for elements in the document similar to BeautifulSoup.

html5-parser also supports searching using xpath, similar to lxml.

Highlights


popularcss-selectors

Example Use


```javascript const { JSDOM } = require('jsdom'); // Parse an HTML string const html = `

Product A

$10.99

Product B

$24.99

</body>

`;

const dom = new JSDOM(html); const document = dom.window.document;

// Use standard DOM APIs to extract data const products = document.querySelectorAll('.product'); products.forEach(product => { const name = product.querySelector('h2').textContent; const price = product.querySelector('.price').textContent; console.log(${name}: ${price}); });

// Fetch and parse a remote page JSDOM.fromURL('https://example.com').then(dom => { const title = dom.window.document.title; console.log('Page title:', title); }); ```

```python from html5_parser import parse

html_string = "Hello, World!" root = parse(html_string) print(root.tag) # html body = root.find("body")

or find all

print(body.text) # "Hello, World!" for el in root.findall("p"): print(el.text) # "Hello ```

Alternatives / Similar


Was this page helpful?