html5libvshtmlquery
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml
(like parsel
or beautifulsoup
).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
htmlquery is a Go library that allows you to parse and extract data from HTML documents using XPath expressions. It provides a simple and intuitive API for traversing and querying the HTML tree structure, and it is built on top of the popular Goquery library.
Example Use
import html5lib
from html5lib import parse
html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)
package main
import (
"fmt"
"log"
"github.com/antchfx/htmlquery"
)
func main() {
// Parse the HTML string
doc, err := htmlquery.Parse([]byte(`
<html>
<body>
<h1>Hello, World!</h1>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
`))
if err != nil {
log.Fatal(err)
}
// Extract the text of the first <h1> element
h1 := htmlquery.FindOne(doc, "//h1")
fmt.Println(htmlquery.InnerText(h1)) // "Hello, World!"
// Extract the text of all <li> elements
lis := htmlquery.Find(doc, "//li")
for _, li := range lis {
fmt.Println(htmlquery.InnerText(li))
}
// "Item 1"
// "Item 2"
// "Item 3"
}