html5-parservshtmlquery
html5-parser is a Python library for parsing HTML and XML documents.
A fast implementation of the HTML 5 parsing spec for Python. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x. This differs, for instance, from the gumbo python bindings, where the initial parsing is done in C but the transformation into the final tree is done in python.
It is built on top of the popular lxml library and provides a simple and intuitive API for working with the document's structure.
html5-parser uses the HTML5 parsing algorithm, which is more lenient and forgiving than the traditional XML-based parsing algorithm. This means that it can parse HTML documents with malformed or missing tags and still produce a usable parse tree.
To use html5-parser, you first need to install it via pip by running pip install html5-parser.
Once it is installed, you can use the html5_parser.parse() function to parse an HTML document and create a parse tree. For example:
``` from html5_parser import parse
html_string = "
Hello, World!" root = parse(html_string) print(root.tag) # html`
You can also use `html5_parser.parse() with file-like objects, bytes or file paths.
Once you have a parse tree, you can use the find() and findall() methods to search for elements
in the document similar to BeautifulSoup.
html5-parser also supports searching using xpath, similar to lxml.
htmlquery is a Go library that allows you to parse and extract data from HTML documents using XPath expressions. It provides a simple and intuitive API for traversing and querying the HTML tree structure, and it is built on top of the popular Goquery library.
Example Use
Hello, World!
- Item 1
- Item 2
- Item 3