html5libvshtmlquery
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
htmlquery is a Go library that allows you to parse and extract data from HTML documents using XPath expressions. It provides a simple and intuitive API for traversing and querying the HTML tree structure, and it is built on top of the popular Goquery library.
Example Use
```python
import html5lib
from html5lib import parse
html_doc = "My Title "
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)
```
```go
package main
import (
"fmt"
"log"
"github.com/antchfx/htmlquery"
)
func main() {
// Parse the HTML string
doc, err := htmlquery.Parse([]byte(`
elements
lis := htmlquery.Find(doc, "//li")
for _, li := range lis {
fmt.Println(htmlquery.InnerText(li))
}
// "Item 1"
// "Item 2"
// "Item 3"
}
```
Hello, World!
- Item 1
- Item 2
- Item 3