soupvshtmlparser2
soup is a Go library for parsing and querying HTML documents.
It provides a simple and intuitive interface for extracting information from HTML pages. It's inspired by popular Python web scraping
library BeautifulSoup and shares similar use API implementing functions like Find
and FindAll
.
soup
can also use go's built-in http client to download HTML content.
Note that unlike beautifulsoup, soup
does not support CSS selectors or XPath.
htmlparser2 is a Node.js library for parsing HTML and XML documents. It works by building a tree of elements, similar to the Document Object Model (DOM) in web browsers. This allows you to easily traverse and manipulate the structure of the document.
htmlparser2 is a low-level html tree parser but it can still be useful in web scraping as it's a powerful tool for HTML restructuring and serialization.
Example Use
package main
import (
"fmt"
"log"
"github.com/anaskhan96/soup"
)
func main() {
url := "https://www.bing.com/search?q=weather+Toronto"
# soup has basic HTTP client though it's not recommended for scraping:
resp, err := soup.Get(url)
if err != nil {
log.Fatal(err)
}
# create soup object from HTML
doc := soup.HTMLParse(resp)
# html elements can be found using Find or FindStrict methods:
# in this case find <div> elements where "class" attribute matches some values:
grid := doc.FindStrict("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
# note: to find all elements FindAll() method can be used the same way
# elements can be further searched for descendents:
heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
conditions := grid.Find("div", "class", "wtr_condition")
primaryCondition := conditions.Find("div")
secondaryCondition := primaryCondition.FindNextElementSibling()
temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
caption := secondaryCondition.Find("div").Text()
fmt.Println("City Name : " + heading)
fmt.Println("Temperature : " + temp + "˚C")
for _, i := range others {
fmt.Println(i.Text())
}
fmt.Println(caption)
}
const htmlparser = require("htmlparser2");
const parser = new htmlparser.Parser({
onopentag: (name, attribs) => {
console.log(`Opening tag: ${name}`);
},
ontext: (text) => {
console.log(`Text: ${text}`);
},
onclosetag: (name) => {
console.log(`Closing tag: ${name}`);
}
}, {decodeEntities: true});
const html = "<p>Hello, <b>world</b>!</p>";
parser.write(html);
parser.end();