Skip to content

html2textvssumy

GPL-3.0 92 8 2,140
12.6 million (month) Dec 14 2008 2025.4.15(2025-04-15 04:02:28 ago)
3,670 4 28 Apache-2.0
Oct 20 2013 152.5 thousand (month) 0.12.0(2026-02-14 21:00:12 ago)

html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.

The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.

html2text also comes with a CLI tool that can convert HTML files to text:

```shell Usage: html2text [filename [encoding]]

Option Description --version Show program's version number and exit -h, --help Show this help message and exit --ignore-links Don't include any formatting for links --escape-all Escape all special characters. Output is less readable, but avoids corner case formatting issues. --reference-links Use reference links instead of links to create markdown --mark-code Mark preformatted and code blocks with [code]...[/code] ```

sumy is a Python library for automatic summarization of text documents. It can be used to extract summaries from various input formats such as plaintext, HTML, and URLs. It supports multiple languages and multiple summarization algorithms, including Latent Semantic Analysis (LSA), Luhn, Edmundson, TextRank, and SumBasic.

Example Use


```python import html2text h = html2text.HTML2Text() # Ignore converting links from HTML h.ignore_links = True print h.handle("

Hello, world!") "Hello, world!" print(h.handle("

Hello, world!")) "Hello, world!" # Don't Ignore links anymore, I like links h.ignore_links = False print(h.handle("

Hello, world!")) "Hello, [world](https://www.google.com/earth/)!" ```

```python # -*- coding: utf-8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from sumy.parsers.html import HtmlParser from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer as Summarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words LANGUAGE = "english" SENTENCES_COUNT = 10 if __name__ == "__main__": url = "https://en.wikipedia.org/wiki/Automatic_summarization" parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE)) # or for plain text files # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE)) # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE)) stemmer = Stemmer(LANGUAGE) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(LANGUAGE) for sentence in summarizer(parser.document, SENTENCES_COUNT): print(sentence) ```

Alternatives / Similar


Was this page helpful?