extructvshtml2text

python html-extractor

BSD-3-Clause 56 12 961

273.1 thousand (month) Oct 27 2015 0.18.0(2024-11-08 14:59:22 ago)

2,140 8 92 GPL-3.0

Dec 14 2008 12.6 million (month) 2025.4.15(2025-04-15 04:02:28 ago)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

W3C's HTML Microdata
embedded JSON-LD
Microformat via mf2py
Facebook's Open Graph
(experimental) RDFa via rdflib
Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.

The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.

html2text also comes with a CLI tool that can convert HTML files to text:

```shell Usage: html2text [filename [encoding]]

Option Description --version Show program's version number and exit -h, --help Show this help message and exit --ignore-links Don't include any formatting for links --escape-all Escape all special characters. Output is less readable, but avoids corner case formatting issues. --reference-links Use reference links instead of links to create markdown --mark-code Mark preformatted and code blocks with [code]...[/code] ```

Example Use

```python # retrieve HTML content import httpx response = httpx.get('https://webscraping.fyi/lib/python/extruct') import extruct all_data = extruct.extract(response.text, response.url) # or we can extract specific metadata format by importing individuals extractors: extractor = extruct.MicrodataExtractor() microdata = extractor.extract(response.text) extractor = extruct.JsonLdExtractor() jsonld = extractor.extract(response.text) ```

```python import html2text h = html2text.HTML2Text() # Ignore converting links from HTML h.ignore_links = True print h.handle("

Hello, world!") "Hello, world!" print(h.handle("

Hello, world!")) "Hello, world!" # Don't Ignore links anymore, I like links h.ignore_links = False print(h.handle("

Hello, world!")) "Hello, [world](https://www.google.com/earth/)!" ```

extructvshtml2text

Example Use

Alternatives / Similar