html2textvsnewspaper
html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.
The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.
html2text also comes with a CLI tool that can convert HTML files to text:
Usage: html2text [filename [encoding]]
Option Description
--version Show program's version number and exit
-h, --help Show this help message and exit
--ignore-links Don't include any formatting for links
--escape-all Escape all special characters. Output is less readable, but avoids corner case formatting issues.
--reference-links Use reference links instead of links to create markdown
--mark-code Mark preformatted and code blocks with [code]...[/code]
newspaper is a Python package that allows developers to easily extract text, images, and videos from articles on the web.
It is designed to be fast, easy to use, and compatible with a wide variety of websites. It uses advanced algorithms to extract relevant information and metadata from articles, and it also supports several languages.
newspaper includes a http client or can ingest pre-scraped HTML documents.
Example Use
import html2text
h = html2text.HTML2Text()
# Ignore converting links from HTML
h.ignore_links = True
print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
"Hello, world!"
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, world!"
# Don't Ignore links anymore, I like links
h.ignore_links = False
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, [world](https://www.google.com/earth/)!"
from newspaper import Article
# Create a new article object
article = Article('https://www.example.com/article')
# Download the article
article.download()
# Parse the article
article.parse()
# Print the article text
print(article.text)
# Print the article title
print(article.title)
# Print the article authors
print(article.authors)
# Print the article publication date
print(article.publish_date)