Skip to content

soupvshtml5lib

MIT 22 1 2,134
58.1 thousand (month) Apr 29 2017 v1.2.5(2 years ago)
1,096 14 83 MIT License
Jul 30 2007 17.0 million (month) 1.1(3 years ago)

soup is a Go library for parsing and querying HTML documents.

It provides a simple and intuitive interface for extracting information from HTML pages. It's inspired by popular Python web scraping library BeautifulSoup and shares similar use API implementing functions like Find and FindAll.

soup can also use go's built-in http client to download HTML content.

Note that unlike beautifulsoup, soup does not support CSS selectors or XPath.

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

Example Use


package main

import (
  "fmt"
  "log"

  "github.com/anaskhan96/soup"
)

func main() {

  url := "https://www.bing.com/search?q=weather+Toronto"

  # soup has basic HTTP client though it's not recommended for scraping:
  resp, err := soup.Get(url)
  if err != nil {
    log.Fatal(err)
  }

  # create soup object from HTML
  doc := soup.HTMLParse(resp)

  # html elements can be found using Find or FindStrict methods:
  # in this case find <div> elements where "class" attribute matches some values:
  grid := doc.FindStrict("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
  # note: to find all elements FindAll() method can be used the same way

  # elements can be further searched for descendents:
  heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
  conditions := grid.Find("div", "class", "wtr_condition")
  primaryCondition := conditions.Find("div")
  secondaryCondition := primaryCondition.FindNextElementSibling()
  temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
  others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
  caption := secondaryCondition.Find("div").Text()

  fmt.Println("City Name : " + heading)
  fmt.Println("Temperature : " + temp + "˚C")
  for _, i := range others {
    fmt.Println(i.Text())
  }
  fmt.Println(caption)
}
import html5lib
from html5lib import parse

html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)

Alternatives / Similar


Was this page helpful?