htmlparser2vshtmlquery
htmlparser2 is a Node.js library for parsing HTML and XML documents. It works by building a tree of elements, similar to the Document Object Model (DOM) in web browsers. This allows you to easily traverse and manipulate the structure of the document.
htmlparser2 is a low-level html tree parser but it can still be useful in web scraping as it's a powerful tool for HTML restructuring and serialization.
htmlquery is a Go library that allows you to parse and extract data from HTML documents using XPath expressions. It provides a simple and intuitive API for traversing and querying the HTML tree structure, and it is built on top of the popular Goquery library.
Example Use
```javascript
const htmlparser = require("htmlparser2");
const parser = new htmlparser.Parser({
onopentag: (name, attribs) => {
console.log(`Opening tag: ${name}`);
},
ontext: (text) => {
console.log(`Text: ${text}`);
},
onclosetag: (name) => {
console.log(`Closing tag: ${name}`);
}
}, {decodeEntities: true});
const html = "
Hello, world!
"; parser.write(html); parser.end(); ```
```go
package main
import (
"fmt"
"log"
"github.com/antchfx/htmlquery"
)
func main() {
// Parse the HTML string
doc, err := htmlquery.Parse([]byte(`
elements
lis := htmlquery.Find(doc, "//li")
for _, li := range lis {
fmt.Println(htmlquery.InnerText(li))
}
// "Item 1"
// "Item 2"
// "Item 3"
}
```
Hello, World!
- Item 1
- Item 2
- Item 3