Skip to content

html5libvsgoquery

MIT 97 14 1,220
32.8 million (month) Jul 30 2007 1.1(2020-06-22 23:32:36 ago)
14,926 3 3 BSD-3-Clause
Aug 29 2016 58.1 thousand (month) v1.12.0(2026-03-15 16:28:52 ago)

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

goquery brings a syntax and a set of features similar to jQuery to the Go language. goquery is a popular and easy-to-use library for Go that allows you to use a CSS selector-like syntax to select elements from an HTML document.

It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.

Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this. Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go's fmt package), even though some of its methods are less than intuitive (looking at you, index()...).

goquery can download HTML by itself (using built-in http client) though it's not recommended for web scraping as it's likely to be blocked.

Example Use


```python import html5lib from html5lib import parse html_doc = "My Title" parsed = parse(html_doc) title = parsed.getElementsByTagName("title")[0] print(title.childNodes[0].nodeValue) ```
```go package main import ( "fmt" "github.com/PuerkitoBio/goquery" ) func main() { // Use goquery.NewDocument to load an HTML document // This can load from URL doc, err := goquery.NewDocument("http://example.com") // or HTML string: doc, err := goquery.NewDocumentFromReader("some html") if err != nil { fmt.Println("Error:", err) return } // Use the Selection.Find method to select elements from the document doc.Find("a").Each(func(i int, s *goquery.Selection) { // Use the Selection.Text method to get the text of the element fmt.Println(s.Text()) // Use the Selection.Attr method to get the value of an attribute fmt.Println(s.Attr("href")) }) } ```

Alternatives / Similar


Was this page helpful?