Skip to content

extructvsgofeed

BSD-3-Clause 56 12 961
273.1 thousand (month) Oct 27 2015 0.18.0(2024-11-08 14:59:22 ago)
2,824 2 55 MIT
Apr 20 2016 58.1 thousand (month) v1.3.0(2024-03-01 03:34:34 ago)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib
  • Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser that will parse and convert all feed types into a hybrid gofeed.Feed model.

You also have the option of utilizing the feed specific atom.Parser or rss.Parser or json.Parser parsers which generate atom. Feed , rss.Feed and json.Feed respectively.

Supported feed types:

  • RSS 0.90
  • Netscape RSS 0.91
  • Userland RSS 0.91
  • RSS 0.92
  • RSS 0.93
  • RSS 0.94
  • RSS 1.0
  • RSS 2.0
  • Atom 0.3
  • Atom 1.0
  • JSON 1.0
  • JSON 1.1

Example Use


```python # retrieve HTML content import httpx response = httpx.get('https://webscraping.fyi/lib/python/extruct') import extruct all_data = extruct.extract(response.text, response.url) # or we can extract specific metadata format by importing individuals extractors: extractor = extruct.MicrodataExtractor() microdata = extractor.extract(response.text) extractor = extruct.JsonLdExtractor() jsonld = extractor.extract(response.text) ```
```go // parse feed from URL fp := gofeed.NewParser() fp.UserAgent = "MyCustomAgent 1.0" // we can modify http client with custom headers etc. feed, _ := fp.ParseURL("http://feeds.twit.tv/twit.xml") fmt.Println(feed.Title) // parse feed from string feedData := ` Sample Feed ` fp := gofeed.NewParser() feed, _ := fp.ParseString(feedData) fmt.Println(feed.Title) // or file file, _ := os.Open("/path/to/a/file.xml") defer file.Close() fp := gofeed.NewParser() feed, _ := fp.Parse(file) fmt.Println(feed.Title) ```

Alternatives / Similar


Was this page helpful?