Skip to content

sumyvsextractnet

Apache-2.0 28 4 3,670
152.5 thousand (month) Oct 20 2013 0.12.0(2026-02-14 21:00:12 ago)
297 9 9 MIT
Dec 11 2020 131 (month) 2.0.7(2022-11-06 07:33:14 ago)

sumy is a Python library for automatic summarization of text documents. It can be used to extract summaries from various input formats such as plaintext, HTML, and URLs. It supports multiple languages and multiple summarization algorithms, including Latent Semantic Analysis (LSA), Luhn, Edmundson, TextRank, and SumBasic.

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

Example Use


```python # -*- coding: utf-8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from sumy.parsers.html import HtmlParser from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer as Summarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words LANGUAGE = "english" SENTENCES_COUNT = 10 if __name__ == "__main__": url = "https://en.wikipedia.org/wiki/Automatic_summarization" parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE)) # or for plain text files # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE)) # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE)) stemmer = Stemmer(LANGUAGE) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(LANGUAGE) for sentence in summarizer(parser.document, SENTENCES_COUNT): print(sentence) ```
```python import requests from extractnet import Extractor raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text results = Extractor().extract(raw_html) {'phone_number': '555-555-5555', 'email': 'example@example.com'} ```

Alternatives / Similar


Was this page helpful?