lxmlvsgoquery

BSD-3-Clause 14 13 3,010

270.5 million (month) Dec 13 2022 6.0.3(2026-04-09 14:33:38 ago)

14,926 3 3 BSD-3-Clause

Aug 29 2016 58.1 thousand (month) v1.12.0(2026-03-15 16:28:52 ago)

lxml is a low-level XML and HTML tree processor. It's used by many other libraries such as parsel or beautifulsoup for higher level HTML parsing.

One of the main features of lxml is its speed and efficiency.
It is built on top of the libxml2 and libxslt C libraries, which are known for their high performance and low memory footprint. This makes lxml well-suited for processing large and complex XML and HTML documents.

One of the key components of lxml is the ElementTree API, which is modeled after the ElementTree API from the Python standard library's xml module. This API provides a simple and intuitive way to access and manipulate the elements and attributes of an XML or HTML document. It also provides a powerful and flexible Xpath engine that allows you to select elements based on their names, attributes, and contents.

Another feature of lxml is its support for parsing and creating XML documents using the XSLT standard. The lxml library provides a powerful and easy-to-use interface for applying XSLT stylesheets to XML documents, which can be used to transform and convert XML documents into other formats, such as HTML, PDF, or even other XML formats.

For web scraping it's best to use other higher level libraries that use lxml like parsel or beautifulsoup

goquery brings a syntax and a set of features similar to jQuery to the Go language. goquery is a popular and easy-to-use library for Go that allows you to use a CSS selector-like syntax to select elements from an HTML document.

It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.

Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this. Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go's fmt package), even though some of its methods are less than intuitive (looking at you, index()...).

goquery can download HTML by itself (using built-in http client) though it's not recommended for web scraping as it's likely to be blocked.

Highlights

low-levelfast

Example Use

```python from lxml import etree # this is our HTML page: html = """ Hello World!

Product Title

paragraph 1

paragraph2

$10

""" tree = tree.fromstring(html) # for parsing, LXML only supports XPath selectors: tree.xpath('//span[@class="price"]')[0].text "$10" ```

```go package main import ( "fmt" "github.com/PuerkitoBio/goquery" ) func main() { // Use goquery.NewDocument to load an HTML document // This can load from URL doc, err := goquery.NewDocument("http://example.com") // or HTML string: doc, err := goquery.NewDocumentFromReader("some html") if err != nil { fmt.Println("Error:", err) return } // Use the Selection.Find method to select elements from the document doc.Find("a").Each(func(i int, s *goquery.Selection) { // Use the Selection.Text method to get the text of the element fmt.Println(s.Text()) // Use the Selection.Attr method to get the value of an attribute fmt.Println(s.Attr("href")) }) } ```