Skip to content


MIT 11 9 194
397 (month) Dec 11 2020 2.0.7(1 year, 8 months ago)
13,882 6 501 MIT
Dec 28 2012 415.4 thousand (month) 0.2.8(5 years ago)

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

newspaper is a Python package that allows developers to easily extract text, images, and videos from articles on the web.

It is designed to be fast, easy to use, and compatible with a wide variety of websites. It uses advanced algorithms to extract relevant information and metadata from articles, and it also supports several languages.

newspaper includes a http client or can ingest pre-scraped HTML documents.

Example Use

import requests
from extractnet import Extractor

raw_html = requests.get('').text
results = Extractor().extract(raw_html)
{'phone_number': '555-555-5555', 'email': ''}
from newspaper import Article

# Create a new article object
article = Article('')

# Download the article

# Parse the article

# Print the article text

# Print the article title

# Print the article authors

# Print the article publication date

Alternatives / Similar

Was this page helpful?