soupvshtml5lib
soup is a Go library for parsing and querying HTML documents.
It provides a simple and intuitive interface for extracting information from HTML pages. It's inspired by popular Python web scraping
library BeautifulSoup and shares similar use API implementing functions like Find
and FindAll
.
soup
can also use go's built-in http client to download HTML content.
Note that unlike beautifulsoup, soup
does not support CSS selectors or XPath.
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml
(like parsel
or beautifulsoup
).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
Example Use
package main
import (
"fmt"
"log"
"github.com/anaskhan96/soup"
)
func main() {
url := "https://www.bing.com/search?q=weather+Toronto"
# soup has basic HTTP client though it's not recommended for scraping:
resp, err := soup.Get(url)
if err != nil {
log.Fatal(err)
}
# create soup object from HTML
doc := soup.HTMLParse(resp)
# html elements can be found using Find or FindStrict methods:
# in this case find <div> elements where "class" attribute matches some values:
grid := doc.FindStrict("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
# note: to find all elements FindAll() method can be used the same way
# elements can be further searched for descendents:
heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
conditions := grid.Find("div", "class", "wtr_condition")
primaryCondition := conditions.Find("div")
secondaryCondition := primaryCondition.FindNextElementSibling()
temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
caption := secondaryCondition.Find("div").Text()
fmt.Println("City Name : " + heading)
fmt.Println("Temperature : " + temp + "˚C")
for _, i := range others {
fmt.Println(i.Text())
}
fmt.Println(caption)
}
import html5lib
from html5lib import parse
html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)