Skip to content

choppervscascadia

MIT 1 3 23
1.7 thousand (month) Jul 24 2014 0.6.0(2023-04-26 10:16:25 ago)
754 1 1 BSD-2-Clause
Feb 20 2018 58.1 thousand (month) Start(2018-02-20 18:47:44 ago)

Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules.

Compared to other HTML parsers Chopper is designed to retain original HTML tree but eliminate elements that do not match parsing rules. Meaning, we can parse HTML elements and keep thei structure for machine learning or other tasks where data structure is needed as well as the data value.

cascadia is a library for Go that provides a CSS selector engine, allowing you to use CSS selectors to select elements from an HTML document.

It is built on top of the html package in the Go standard library, and provides a more efficient and powerful way to select elements from an HTML document.

Example Use


```python HTML = """ Test
HELLO WORLD Do not want

<div id="footer"></div>

"""

CSS = """ div { border: 1px solid black; } div#main { color: blue; } div.iwantthis { background-color: red; } a { color: green; } div#footer { border-top: 2px solid red; } """

extractor = Extractor.keep('//div[@class="iwantthis"]').discard('//a') html, css = extractor.extract(HTML, CSS)

will result in:

html """

HELLO WORLD

"""

css """ div{border:1px solid black;} div#main{color:blue;} div.iwantthis{background-color:red;} """ ```

```go package main

import ( "fmt" "github.com/andybalholm/cascadia" "golang.org/x/net/html" "strings" )

func main() { // Create an HTML string html := `

Hello, World!

Example

    </body>
  </html>`

// Parse the HTML string into a node tree doc, err := html.Parse(strings.NewReader(html)) if err != nil { fmt.Println("Error:", err) return }

// Compile the CSS selector sel, err := cascadia.Compile("p") if err != nil { fmt.Println("Error:", err) return }

// Use the Selector.Match method to select elements from the document matches := sel.Match(doc) if len(matches) > 0 { fmt.Println(matches[0].FirstChild.Data) // > Hello, World! } } ```

Alternatives / Similar


Was this page helpful?