Skip to content

sumyvshtml2text

Apache License, Version 2.0 20 2 3,375
65.9 thousand (month) Oct 20 2013 0.11.0(1 year, 5 months ago)
1,595 8 85 GNU GPL 3
2024.2.26(a month ago) Dec 14 2008 1.5 million (month)

sumy is a Python library for automatic summarization of text documents. It can be used to extract summaries from various input formats such as plaintext, HTML, and URLs. It supports multiple languages and multiple summarization algorithms, including Latent Semantic Analysis (LSA), Luhn, Edmundson, TextRank, and SumBasic.

html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.

The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.

html2text also comes with a CLI tool that can convert HTML files to text:

Usage: html2text [filename [encoding]]

Option  Description
--version   Show program's version number and exit
-h, --help  Show this help message and exit
--ignore-links  Don't include any formatting for links
--escape-all    Escape all special characters. Output is less readable, but avoids corner case formatting issues.
--reference-links   Use reference links instead of links to create markdown
--mark-code Mark preformatted and code blocks with [code]...[/code]

Example Use


# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
import html2text

h = html2text.HTML2Text()

# Ignore converting links from HTML
h.ignore_links = True
print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
"Hello, world!"

print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))

"Hello, world!"

# Don't Ignore links anymore, I like links
h.ignore_links = False
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, [world](https://www.google.com/earth/)!"

Alternatives / Similar