html2textvsreadability
html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.
The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.
html2text also comes with a CLI tool that can convert HTML files to text:
Usage: html2text [filename [encoding]]
Option Description
--version Show program's version number and exit
-h, --help Show this help message and exit
--ignore-links Don't include any formatting for links
--escape-all Escape all special characters. Output is less readable, but avoids corner case formatting issues.
--reference-links Use reference links instead of links to create markdown
--mark-code Mark preformatted and code blocks with [code]...[/code]
python-readability is a python package that allows developers to extract the main content of a web page, removing any unnecessary or unwanted elements, such as ads, navigation, and sidebars.
It is based on the algorithm used by the popular web-based service, Readability, and it uses the beautifulsoup4 package to parse the HTML and extract the main content.
Readability is similar to Newspaper in terms that it's extracting HTML data
Example Use
import html2text
h = html2text.HTML2Text()
# Ignore converting links from HTML
h.ignore_links = True
print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
"Hello, world!"
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, world!"
# Don't Ignore links anymore, I like links
h.ignore_links = False
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, [world](https://www.google.com/earth/)!"
import requests
from readability import document
response = requests.get('http://example.com')
doc = document(response.content)
doc.title()
'example domain'
doc.summary()
"""<html><body><div><body id="readabilitybody">\n<div>\n <h1>example domain</h1>\n
<p>this domain is established to be used for illustrative examples in documents. you may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">more information...</a></p>\n</div>
\n</body>\n</div></body></html>"""