Skip to content

trafilaturavshtml2text

GPLv3+ 68 3 2,594
251.4 thousand (month) Jul 17 2019 1.7.0(2 months ago)
1,595 8 85 GNU GPL 3
2024.2.26(a month ago) Dec 14 2008 1.5 million (month)

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.

html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.

The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.

html2text also comes with a CLI tool that can convert HTML files to text:

Usage: html2text [filename [encoding]]

Option  Description
--version   Show program's version number and exit
-h, --help  Show this help message and exit
--ignore-links  Don't include any formatting for links
--escape-all    Escape all special characters. Output is less readable, but avoids corner case formatting issues.
--reference-links   Use reference links instead of links to create markdown
--mark-code Mark preformatted and code blocks with [code]...[/code]

Example Use


# it can be used to clean HTML files
from trafilatura import clean_html

html = '<html><head><title>My Title</title></head><body><p>This is some <b>bold</b> text.</p></body></html>'
cleaned_html = clean_html(html)
print(cleaned_html)

# can strip away tags:
clean_html(html, tags_to_remove=["title"])
# or attributes
clean_html(html, attributes_to_remove=["title"])
import html2text

h = html2text.HTML2Text()

# Ignore converting links from HTML
h.ignore_links = True
print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
"Hello, world!"

print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))

"Hello, world!"

# Don't Ignore links anymore, I like links
h.ignore_links = False
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, [world](https://www.google.com/earth/)!"

Alternatives / Similar