Skip to content

extractnetvsgofeed

MIT 11 9 263
1.1 thousand (month) Dec 11 2020 2.0.7(2 years ago)
2,641 2 42 MIT
Apr 20 2016 58.1 thousand (month) v1.3.0(1 year, 5 months ago)

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser that will parse and convert all feed types into a hybrid gofeed.Feed model.

You also have the option of utilizing the feed specific atom.Parser or rss.Parser or json.Parser parsers which generate atom. Feed , rss.Feed and json.Feed respectively.

Supported feed types:

  • RSS 0.90
  • Netscape RSS 0.91
  • Userland RSS 0.91
  • RSS 0.92
  • RSS 0.93
  • RSS 0.94
  • RSS 1.0
  • RSS 2.0
  • Atom 0.3
  • Atom 1.0
  • JSON 1.0
  • JSON 1.1

Example Use


import requests
from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text
results = Extractor().extract(raw_html)
{'phone_number': '555-555-5555', 'email': 'example@example.com'}
// parse feed from URL
fp := gofeed.NewParser()
fp.UserAgent = "MyCustomAgent 1.0"  // we can modify http client with custom headers etc.
feed, _ := fp.ParseURL("http://feeds.twit.tv/twit.xml")
fmt.Println(feed.Title)

// parse feed from string
feedData := `<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>`
fp := gofeed.NewParser()
feed, _ := fp.ParseString(feedData)
fmt.Println(feed.Title)

// or file
file, _ := os.Open("/path/to/a/file.xml")
defer file.Close()
fp := gofeed.NewParser()
feed, _ := fp.Parse(file)
fmt.Println(feed.Title)

Alternatives / Similar


Was this page helpful?