Skip to content

soupvschopper

MIT 22 1 2,116
58.1 thousand (month) Apr 29 2017 v1.2.5(2 years ago)
22 3 1 MIT
0.6.0(11 months ago) Jul 24 2014 820 (month)

soup is a Go library for parsing and querying HTML documents.

It provides a simple and intuitive interface for extracting information from HTML pages. It's inspired by popular Python web scraping library BeautifulSoup and shares similar use API implementing functions like Find and FindAll.

soup can also use go's built-in http client to download HTML content.

Note that unlike beautifulsoup, soup does not support CSS selectors or XPath.

Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules.

Compared to other HTML parsers Chopper is designed to retain original HTML tree but eliminate elements that do not match parsing rules. Meaning, we can parse HTML elements and keep thei structure for machine learning or other tasks where data structure is needed as well as the data value.

Example Use


package main

import (
  "fmt"
  "log"

  "github.com/anaskhan96/soup"
)

func main() {

  url := "https://www.bing.com/search?q=weather+Toronto"

  # soup has basic HTTP client though it's not recommended for scraping:
  resp, err := soup.Get(url)
  if err != nil {
    log.Fatal(err)
  }

  # create soup object from HTML
  doc := soup.HTMLParse(resp)

  # html elements can be found using Find or FindStrict methods:
  # in this case find <div> elements where "class" attribute matches some values:
  grid := doc.FindStrict("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
  # note: to find all elements FindAll() method can be used the same way

  # elements can be further searched for descendents:
  heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
  conditions := grid.Find("div", "class", "wtr_condition")
  primaryCondition := conditions.Find("div")
  secondaryCondition := primaryCondition.FindNextElementSibling()
  temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
  others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
  caption := secondaryCondition.Find("div").Text()

  fmt.Println("City Name : " + heading)
  fmt.Println("Temperature : " + temp + "˚C")
  for _, i := range others {
    fmt.Println(i.Text())
  }
  fmt.Println(caption)
}
HTML = """
<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <div id="header"></div>
    <div id="main">
      <div class="iwantthis">
        HELLO WORLD
        <a href="/nope">Do not want</a>
      </div>
    </div>
    <div id="footer"></div>
  </body>
</html>
"""

CSS = """
div { border: 1px solid black; }
div#main { color: blue; }
div.iwantthis { background-color: red; }
a { color: green; }
div#footer { border-top: 2px solid red; }
"""

extractor = Extractor.keep('//div[@class="iwantthis"]').discard('//a')
html, css = extractor.extract(HTML, CSS)

# will result in:
html
"""
<html>
  <body>
    <div id="main">
      <div class="iwantthis">
        HELLO WORLD
      </div>
    </div>
  </body>
</html>"""

css
"""
div{border:1px solid black;}
div#main{color:blue;}
div.iwantthis{background-color:red;}
"""

Alternatives / Similar