html2textvsextractnet

GPL-3.0 92 8 2,140

12.6 million (month) Dec 14 2008 2025.4.15(2025-04-15 04:02:28 ago)

297 9 9 MIT

Dec 11 2020 131 (month) 2.0.7(2022-11-06 07:33:14 ago)

html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.

The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.

html2text also comes with a CLI tool that can convert HTML files to text:

```shell Usage: html2text [filename [encoding]]

Option Description --version Show program's version number and exit -h, --help Show this help message and exit --ignore-links Don't include any formatting for links --escape-all Escape all special characters. Output is less readable, but avoids corner case formatting issues. --reference-links Use reference links instead of links to create markdown --mark-code Mark preformatted and code blocks with [code]...[/code] ```

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

Example Use

```python import html2text h = html2text.HTML2Text() # Ignore converting links from HTML h.ignore_links = True print h.handle("

Hello, world!") "Hello, world!" print(h.handle("

Hello, world!")) "Hello, world!" # Don't Ignore links anymore, I like links h.ignore_links = False print(h.handle("

Hello, world!")) "Hello, [world](https://www.google.com/earth/)!" ```

```python import requests from extractnet import Extractor raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text results = Extractor().extract(raw_html) {'phone_number': '555-555-5555', 'email': 'example@example.com'} ```

html2textvsextractnet

Example Use

Alternatives / Similar