Skip to content

extruct

832 11 53 BSD-3-Clause
0.17.0 (29 May 2024) Oct 27 2015 791.3 thousand (month)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib
  • Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

Example Use


# retrieve HTML content
import httpx

response = httpx.get('https://webscraping.fyi/lib/python/extruct')

import extruct

all_data = extruct.extract(response.text, response.url)

# or we can extract specific metadata format by importing individuals extractors:


extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)

extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text) 

Alternatives / Similar


1,768 2024.2.26 (5 months ago) Dec 14 2008 compare
13,945 0.2.8 (5 years ago) Dec 28 2012 compare
3,467 0.11.0 (1 year, 9 months ago) Oct 20 2013 compare
3,270 1.12.0 (8 days ago) Jul 17 2019 compare
2,614 0.8.1 (4 years ago) Jun 30 2011 compare
203 2.0.7 (1 year, 9 months ago) Dec 11 2020 compare
10,745 1.1.9 (5 years ago) Aug 24 2018 compare

Other Languages

2,533 v1.3.0 (5 months ago) Apr 20 2016 compare
2,075 v4.4.12 (14 days ago) Oct 26 2013 compare
Was this page helpful?