Skip to content

html5libvshtmlquery

MIT 97 14 1,220
32.8 million (month) Jul 30 2007 1.1(2020-06-22 23:32:36 ago)
781 1 8 MIT
Feb 07 2019 58.1 thousand (month) v1.3.6(2026-03-06 04:46:15 ago)

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

htmlquery is a Go library that allows you to parse and extract data from HTML documents using XPath expressions. It provides a simple and intuitive API for traversing and querying the HTML tree structure, and it is built on top of the popular Goquery library.

Example Use


```python import html5lib from html5lib import parse html_doc = "My Title" parsed = parse(html_doc) title = parsed.getElementsByTagName("title")[0] print(title.childNodes[0].nodeValue) ```
```go package main import ( "fmt" "log" "github.com/antchfx/htmlquery" ) func main() { // Parse the HTML string doc, err := htmlquery.Parse([]byte(`

Hello, World!

  • Item 1
  • Item 2
  • Item 3
`)) if err != nil { log.Fatal(err) } // Extract the text of the first

element h1 := htmlquery.FindOne(doc, "//h1") fmt.Println(htmlquery.InnerText(h1)) // "Hello, World!" // Extract the text of all
  • elements lis := htmlquery.Find(doc, "//li") for _, li := range lis { fmt.Println(htmlquery.InnerText(li)) } // "Item 1" // "Item 2" // "Item 3" } ```
  • Alternatives / Similar


    Was this page helpful?