Skip to content

extruct

961 12 56 BSD-3-Clause
0.18.0 (8 Nov 2024) Oct 27 2015 273.1 thousand (month)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib
  • Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

Example Use


```python

retrieve HTML content

import httpx

response = httpx.get('https://webscraping.fyi/lib/python/extruct')

import extruct

all_data = extruct.extract(response.text, response.url)

or we can extract specific metadata format by importing individuals extractors:

extractor = extruct.MicrodataExtractor() microdata = extractor.extract(response.text)

extractor = extruct.JsonLdExtractor() jsonld = extractor.extract(response.text) ```

Alternatives / Similar


2,140 2025.4.15 (2025-04-15 04:02:28 ago) Dec 14 2008 compare
5,650 2.0.0 (2024-12-03 15:23:21 ago) Jul 17 2019 compare
2,894 0.8.4.1 (2025-05-03 21:11:43 ago) Jun 30 2011 compare
15,018 0.2.8 (2018-09-28 04:58:18 ago) Dec 28 2012 compare
3,670 0.12.0 (2026-02-14 21:00:12 ago) Oct 20 2013 compare
12,807 1.1.9 (2018-10-21 03:39:17 ago) Aug 24 2018 compare
297 2.0.7 (2022-11-06 07:33:14 ago) Dec 11 2020 compare

Other Languages

2,824 v1.3.0 (2024-03-01 03:34:34 ago) Apr 20 2016 compare
2,103 v4.4.15 (2025-01-02 16:53:09 ago) Oct 26 2013 compare
Was this page helpful?