Skip to content

gofeedvsextractnet

MIT 43 2 2,361
58.1 thousand (month) Apr 20 2016 v1.2.1(11 months ago)
131 9 9 MIT
2.0.7(1 year, 3 months ago) Dec 11 2020 282 (month)

The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser that will parse and convert all feed types into a hybrid gofeed.Feed model.

You also have the option of utilizing the feed specific atom.Parser or rss.Parser or json.Parser parsers which generate atom. Feed , rss.Feed and json.Feed respectively.

Supported feed types:

  • RSS 0.90
  • Netscape RSS 0.91
  • Userland RSS 0.91
  • RSS 0.92
  • RSS 0.93
  • RSS 0.94
  • RSS 1.0
  • RSS 2.0
  • Atom 0.3
  • Atom 1.0
  • JSON 1.0
  • JSON 1.1

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

Example Use


// parse feed from URL
fp := gofeed.NewParser()
fp.UserAgent = "MyCustomAgent 1.0"  // we can modify http client with custom headers etc.
feed, _ := fp.ParseURL("http://feeds.twit.tv/twit.xml")
fmt.Println(feed.Title)

// parse feed from string
feedData := `<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>`
fp := gofeed.NewParser()
feed, _ := fp.ParseString(feedData)
fmt.Println(feed.Title)

// or file
file, _ := os.Open("/path/to/a/file.xml")
defer file.Close()
fp := gofeed.NewParser()
feed, _ := fp.Parse(file)
fmt.Println(feed.Title)
import requests
from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text
results = Extractor().extract(raw_html)
{'phone_number': '555-555-5555', 'email': 'example@example.com'}

Alternatives / Similar