Skip to content

newspapervsextractnet

MIT 520 6 13,656
706.6 thousand (month) Dec 28 2012 0.2.8(5 years ago)
157 9 9 MIT
2.0.7(1 year, 4 months ago) Dec 11 2020 216 (month)

newspaper is a Python package that allows developers to easily extract text, images, and videos from articles on the web.

It is designed to be fast, easy to use, and compatible with a wide variety of websites. It uses advanced algorithms to extract relevant information and metadata from articles, and it also supports several languages.

newspaper includes a http client or can ingest pre-scraped HTML documents.

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

Example Use


from newspaper import Article

# Create a new article object
article = Article('https://www.example.com/article')

# Download the article
article.download()

# Parse the article
article.parse()

# Print the article text
print(article.text)

# Print the article title
print(article.title)

# Print the article authors
print(article.authors)

# Print the article publication date
print(article.publish_date)
import requests
from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text
results = Extractor().extract(raw_html)
{'phone_number': '555-555-5555', 'email': 'example@example.com'}

Alternatives / Similar