html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
import html5lib from html5lib import parse html_doc = "<html><head><title>My Title</title></head><body></body></html>" parsed = parse(html_doc) title = parsed.getElementsByTagName("title") print(title.childNodes.nodeValue)