Skip to content

htmlqueryvshtmlparser2

MIT 8 1 693
58.1 thousand (month) Feb 07 2019 v1.3.0(1 year, 2 months ago)
4,263 4 12 MIT
9.1.0(2 months ago) Aug 28 2011 127.1 million (month)

htmlquery is a Go library that allows you to parse and extract data from HTML documents using XPath expressions. It provides a simple and intuitive API for traversing and querying the HTML tree structure, and it is built on top of the popular Goquery library.

htmlparser2 is a Node.js library for parsing HTML and XML documents. It works by building a tree of elements, similar to the Document Object Model (DOM) in web browsers. This allows you to easily traverse and manipulate the structure of the document.

htmlparser2 is a low-level html tree parser but it can still be useful in web scraping as it's a powerful tool for HTML restructuring and serialization.

Example Use


package main

import (
  "fmt"
  "log"

  "github.com/antchfx/htmlquery"
)

func main() {
  // Parse the HTML string
  doc, err := htmlquery.Parse([]byte(`
    <html>
      <body>
        <h1>Hello, World!</h1>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
          <li>Item 3</li>
        </ul>
      </body>
    </html>
  `))
  if err != nil {
    log.Fatal(err)
  }

  // Extract the text of the first <h1> element
  h1 := htmlquery.FindOne(doc, "//h1")
  fmt.Println(htmlquery.InnerText(h1)) // "Hello, World!"

  // Extract the text of all <li> elements
  lis := htmlquery.Find(doc, "//li")
  for _, li := range lis {
    fmt.Println(htmlquery.InnerText(li))
  }
  // "Item 1"
  // "Item 2"
  // "Item 3"
}
const htmlparser = require("htmlparser2");
const parser = new htmlparser.Parser({
    onopentag: (name, attribs) => {
        console.log(`Opening tag: ${name}`);
    },
    ontext: (text) => {
        console.log(`Text: ${text}`);
    },
    onclosetag: (name) => {
        console.log(`Closing tag: ${name}`);
    }
}, {decodeEntities: true});

const html = "<p>Hello, <b>world</b>!</p>";
parser.write(html);
parser.end();

Alternatives / Similar