html5-parservsxmltodict

Apache-2.0 1 1 683

40.3 thousand (month) Jun 03 2007 0.4.12(1 year, 3 months ago)

5,577 2 88 MIT

Jul 30 2007 62.5 million (month) 0.14.2(4 months ago)

html5-parser is a Python library for parsing HTML and XML documents.

A fast implementation of the HTML 5 parsing spec for Python. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x. This differs, for instance, from the gumbo python bindings, where the initial parsing is done in C but the transformation into the final tree is done in python.

It is built on top of the popular lxml library and provides a simple and intuitive API for working with the document's structure.

html5-parser uses the HTML5 parsing algorithm, which is more lenient and forgiving than the traditional XML-based parsing algorithm. This means that it can parse HTML documents with malformed or missing tags and still produce a usable parse tree.

To use html5-parser, you first need to install it via pip by running pip install html5-parser. Once it is installed, you can use the html5_parser.parse() function to parse an HTML document and create a parse tree. For example:

from html5_parser import parse

html_string = "<html><body>Hello, World!</body></html>"
root = parse(html_string)
print(root.tag) # html

You can also use `html5_parser.parse()`` with file-like objects, bytes or file paths.

Once you have a parse tree, you can use the find() and findall() methods to search for elements in the document similar to BeautifulSoup.

html5-parser also supports searching using xpath, similar to lxml.

xmltodict is a Python library that allows you to work with XML data as if it were JSON. It allows you to parse XML documents and convert them to dictionaries, which can then be easily manipulated using standard dictionary operations.

You can also use the library to convert a dictionary back into an XML document. xmltodict is built on top of the popular lxml library and provides a simple, intuitive API for working with XML data.

Note that despite using lxml conversion speeds can be quite slow for large XML documents and in web scraping this should be used to parse specific snippets instead of whole HTML documents.

xmltodict pairs well with JSON parsing tools like jmespath or jsonpath. Alternatively, it can be used in reverse mode to parse JSON documents using HTML parsing tools like CSS selectors and XPath.

It can be installed via pip by running pip install xmltodict command.

Example Use

from html5_parser import parse

html_string = "<html><body>Hello, World!</body></html>"
root = parse(html_string)
print(root.tag) # html
body = root.find("body")
# or find all
print(body.text) # "Hello, World!"
for el in root.findall("p"):
    print(el.text) # "Hello

import xmltodict

xml_string = """
<book>
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <publisher>Charles Scribner's Sons</publisher>
    <publication_date>1925</publication_date>
</book>
"""

book_dict = xmltodict.parse(xml_string)
print(book_dict)
{'book': {'title': 'The Great Gatsby',
'author': 'F. Scott Fitzgerald',
'publisher': "Charles Scribner's Sons",
'publication_date': '1925'}}

# and to reverse:
book_xml = xmltodict.unparse(book_dict)
print(book_xml)

# the xml can be loaded and parsed using parsel or beautifulsoup:
from parsel import Selector
sel = Selector(book_xml)
print(sel.css('publication_date::text').get())
'1925'

Alternatives / Similar

sax-js

1,101 compare

parse5

3,698 compare

htmlparser2

4,529 compare

beautifulsoup

- compare

lxml

2,737 compare

xmltodict

5,577 compare

cheerio

28,873 compare

html5lib

1,153 compare

cssselect

293 compare

nokogiri

6,173 compare

feedparser

2,048 compare

pyquery

2,312 compare

parsel

1,187 compare

requests-html

13,780 compare

xml2

221 compare

rvest

1,498 compare

selectolax

1,186 compare

html5-php

1,638 compare

untangle

619 compare

domcrawler

3,985 compare

cascadia

717 compare

goquery

14,273 compare

htmlquery

760 compare

soup

2,191 compare

xpath

699 compare

chompjs

202 compare

gazpacho

764 compare

embed

2,103 compare

chopper

22 compare

ralger

156 compare

untangle

619 compare