extructvsgofeed

python html-extractor

go html-extractor

BSD-3-Clause 56 12 961

273.1 thousand (month) Oct 27 2015 0.18.0(2024-11-08 14:59:22 ago)

2,824 2 55 MIT

Apr 20 2016 58.1 thousand (month) v1.3.0(2024-03-01 03:34:34 ago)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

W3C's HTML Microdata
embedded JSON-LD
Microformat via mf2py
Facebook's Open Graph
(experimental) RDFa via rdflib
Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser that will parse and convert all feed types into a hybrid gofeed.Feed model.

You also have the option of utilizing the feed specific atom.Parser or rss.Parser or json.Parser parsers which generate atom. Feed , rss.Feed and json.Feed respectively.

Supported feed types:

RSS 0.90
Netscape RSS 0.91
Userland RSS 0.91
RSS 0.92
RSS 0.93
RSS 0.94
RSS 1.0
RSS 2.0
Atom 0.3
Atom 1.0
JSON 1.0
JSON 1.1

Example Use

```python # retrieve HTML content import httpx response = httpx.get('https://webscraping.fyi/lib/python/extruct') import extruct all_data = extruct.extract(response.text, response.url) # or we can extract specific metadata format by importing individuals extractors: extractor = extruct.MicrodataExtractor() microdata = extractor.extract(response.text) extractor = extruct.JsonLdExtractor() jsonld = extractor.extract(response.text) ```

```go // parse feed from URL fp := gofeed.NewParser() fp.UserAgent = "MyCustomAgent 1.0" // we can modify http client with custom headers etc. feed, _ := fp.ParseURL("http://feeds.twit.tv/twit.xml") fmt.Println(feed.Title) // parse feed from string feedData := ` Sample Feed ` fp := gofeed.NewParser() feed, _ := fp.ParseString(feedData) fmt.Println(feed.Title) // or file file, _ := os.Open("/path/to/a/file.xml") defer file.Close() fp := gofeed.NewParser() feed, _ := fp.Parse(file) fmt.Println(feed.Title) ```

extructvsgofeed

Example Use

Alternatives / Similar