xpathvsfeedparser
xpath is a library for Go that allows you to use XPath expressions to select elements from an HTML document. It is built on top of the html package in the Go standard library, and provides a way to select elements from an HTML document using XPath expressions, which are more powerful and expressive than CSS selectors.
feedparser is a Python module for downloading and parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Appleās iTunes extensions.
To use Universal Feed Parser, you will need Python 3.6 or later. Universal Feed Parser is not meant to run standalone; it is a module for you to use as part of a larger Python program.
feedparser can be used to scrape data feeds as it can download them and parse the XML structured data.
Example Use
package main
import (
"fmt"
"github.com/antchfx/xpath"
"golang.org/x/net/html"
"strings"
)
func main() {
// Create an HTML string
html := `<html>
<body>
<div id="content">
<p>Hello, World!</p>
<a href="http://example.com">Example</a>
</div>
</body>
</html>`
// Parse the HTML string into a node tree
doc, err := html.Parse(strings.NewReader(html))
if err != nil {
fmt.Println("Error:", err)
return
}
// Compile the XPath expression
expr, err := xpath.Compile("//p")
if err != nil {
fmt.Println("Error:", err)
return
}
// Use the Evaluate method to select elements from the document
nodes, err := expr.Evaluate(xpath.NodeNavigator(doc))
if err != nil {
fmt.Println("Error:", err)
return
}
if nodes.MoveNext() {
fmt.Println(nodes.Current().Value())
// > Hello, World!
}
}
import feedparser
# the feed can be loaded from a remote URL
data = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
# local path
data = feedparser.parse('/home/user/data.xml')
# or raw string
data = feedparser.parse('<xml>...</xml>')
# the result dataset is a nested python dictionary containing feed data:
data['feed']['title']