cheeriovshtml5-parser
cheerio is a popular JavaScript library that allows you to interact with and manipulate HTML and XML documents in a similar way to how you would with jQuery in a browser. It is a fast, flexible, and lean implementation of core jQuery designed specifically for the server.
One of the main benefits of using cheerio is that it allows you to use jQuery-like syntax to navigate and m anipulate the Document Object Model (DOM) of an HTML or XML document, making it easy to work with.
cheerio supports CSS selectors though not XPath.
html5-parser is a Python library for parsing HTML and XML documents.
A fast implementation of the HTML 5 parsing spec for Python. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x. This differs, for instance, from the gumbo python bindings, where the initial parsing is done in C but the transformation into the final tree is done in python.
It is built on top of the popular lxml library and provides a simple and intuitive API for working with the document's structure.
html5-parser uses the HTML5 parsing algorithm, which is more lenient and forgiving than the traditional XML-based parsing algorithm. This means that it can parse HTML documents with malformed or missing tags and still produce a usable parse tree.
To use html5-parser, you first need to install it via pip by running pip install html5-parser
.
Once it is installed, you can use the html5_parser.parse() function to parse an HTML document and create a parse tree. For example:
from html5_parser import parse
html_string = "<html><body>Hello, World!</body></html>"
root = parse(html_string)
print(root.tag) # html
Once you have a parse tree, you can use the find()
and findall()
methods to search for elements
in the document similar to BeautifulSoup.
html5-parser also supports searching using xpath, similar to lxml.
Example Use
const cheerio = require('cheerio');
const $ = cheerio.load('<html><head><title>My title</title></head><body><h1 class='name'>Hello World!</h1></body></html>');
// use css selectors
console.log($('title').text()); // My title
console.log($('.name').text()); // Hello World!
// select multiple elements
const $ = cheerio.load('<html><body><ul><li>item 1</li><li>item 2</li></ul></body></html>');
$('li').each(function(i, elem) {
console.log($(this).text());
});
// modify elements
const $ = cheerio.load('<html><body><h1>Hello World!</h1></body></html>');
$('h1').text('Hello, Cheerio!');
console.log($.html());
from html5_parser import parse
html_string = "<html><body>Hello, World!</body></html>"
root = parse(html_string)
print(root.tag) # html
body = root.find("body")
# or find all
print(body.text) # "Hello, World!"
for el in root.findall("p"):
print(el.text) # "Hello