readabilityvsextractnet

Apache-2.0 37 5 2,894

1.6 million (month) Jun 30 2011 0.8.4.1(2025-05-03 21:11:43 ago)

297 9 9 MIT

Dec 11 2020 131 (month) 2.0.7(2022-11-06 07:33:14 ago)

python-readability is a python package that allows developers to extract the main content of a web page, removing any unnecessary or unwanted elements, such as ads, navigation, and sidebars.

It is based on the algorithm used by the popular web-based service, Readability, and it uses the beautifulsoup4 package to parse the HTML and extract the main content.

Readability is similar to Newspaper in terms that it's extracting HTML data

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

Example Use

```python import requests from readability import document response = requests.get('http://example.com') doc = document(response.content) doc.title() 'example domain' doc.summary() """

\n

example domain

\n

this domain is established to be used for illustrative examples in documents. you may use this\n domain in examples without prior coordination or asking for permission.

\n

more information...

\n

\n\n

""" ```

```python import requests from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text results = Extractor().extract(raw_html) {'phone_number': '555-555-5555', 'email': 'example@example.com'} ```

readabilityvsextractnet

Example Use

example domain

Alternatives / Similar