sax-jsvshtml5lib
sax-js is a streaming XML parser for Node.js that is built on top of the sax C library. It is designed to be fast, low-memory, and easy to use. It is commonly used for parsing large XML files, as it allows you to process the XML data incrementally, rather than loading the entire file into memory at once.
sax-js is a low-level html tree parser and does not provide html query capabilities (like CSS selectors) though it can be useful in HTML tree parsing and serialization.
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml
(like parsel
or beautifulsoup
).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
Example Use
const fs = require("fs");
const sax = require("sax");
const xmlStream = fs.createReadStream("example.xml");
const saxParser = sax.createStream(true, {});
saxParser.on("opentag", function(node) {
console.log(`<${node.name}>`);
});
saxParser.on("closetag", function(nodeName) {
console.log(`</${nodeName}>`);
});
saxParser.on("text", function(text) {
console.log(text);
});
xmlStream.pipe(saxParser);
import html5lib
from html5lib import parse
html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)