readabilityvsnewspaper

Apache-2.0 34 5 2,724

270.7 thousand (month) Jun 30 2011 0.8.1(5 years ago)

14,364 6 502 MIT

Dec 28 2012 580.4 thousand (month) 0.2.8(6 years ago)

python-readability is a python package that allows developers to extract the main content of a web page, removing any unnecessary or unwanted elements, such as ads, navigation, and sidebars.

It is based on the algorithm used by the popular web-based service, Readability, and it uses the beautifulsoup4 package to parse the HTML and extract the main content.

Readability is similar to Newspaper in terms that it's extracting HTML data

newspaper is a Python package that allows developers to easily extract text, images, and videos from articles on the web.

It is designed to be fast, easy to use, and compatible with a wide variety of websites. It uses advanced algorithms to extract relevant information and metadata from articles, and it also supports several languages.

newspaper includes a http client or can ingest pre-scraped HTML documents.

Example Use

import requests
from readability import document

response = requests.get('http://example.com')
doc = document(response.content)
doc.title()
'example domain'

doc.summary()
"""<html><body><div><body id="readabilitybody">\n<div>\n    <h1>example domain</h1>\n
<p>this domain is established to be used for illustrative examples in documents. you may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">more information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

from newspaper import Article

# Create a new article object
article = Article('https://www.example.com/article')

# Download the article
article.download()

# Parse the article
article.parse()

# Print the article text
print(article.text)

# Print the article title
print(article.title)

# Print the article authors
print(article.authors)

# Print the article publication date
print(article.publish_date)

readabilityvsnewspaper

Example Use

Alternatives / Similar