extruct
extruct is a library for extracting embedded metadata from HTML markup.
Currently, extruct supports:
- W3C's HTML Microdata
- embedded JSON-LD
- Microformat via mf2py
- Facebook's Open Graph
- (experimental) RDFa via rdflib
- Dublin Core Metadata (DC-HTML-2003)
Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.
Example Use
# retrieve HTML content
import httpx
response = httpx.get('https://webscraping.fyi/lib/python/extruct')
import extruct
all_data = extruct.extract(response.text, response.url)
# or we can extract specific metadata format by importing individuals extractors:
extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)
extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text)