htmlparser2vshtml5lib
htmlparser2 is a Node.js library for parsing HTML and XML documents. It works by building a tree of elements, similar to the Document Object Model (DOM) in web browsers. This allows you to easily traverse and manipulate the structure of the document.
htmlparser2 is a low-level html tree parser but it can still be useful in web scraping as it's a powerful tool for HTML restructuring and serialization.
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml
(like parsel
or beautifulsoup
).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
Example Use
const htmlparser = require("htmlparser2");
const parser = new htmlparser.Parser({
onopentag: (name, attribs) => {
console.log(`Opening tag: ${name}`);
},
ontext: (text) => {
console.log(`Text: ${text}`);
},
onclosetag: (name) => {
console.log(`Closing tag: ${name}`);
}
}, {decodeEntities: true});
const html = "<p>Hello, <b>world</b>!</p>";
parser.write(html);
parser.end();
import html5lib
from html5lib import parse
html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)