Skip to content

trafilatura

5,650 4 107 Apache-2.0
2.0.0 (3 Dec 2024) Jul 17 2019 5.2 million (month)

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.

Example Use


```python

it can be used to clean HTML files

from trafilatura import clean_html

html = 'My Title

This is some bold text.

' cleaned_html = clean_html(html) print(cleaned_html)

can strip away tags:

clean_html(html, tags_to_remove=["title"])

or attributes

clean_html(html, attributes_to_remove=["title"]) ```

Alternatives / Similar


2,140 2025.4.15 (2025-04-15 04:02:28 ago) Dec 14 2008 compare
2,894 0.8.4.1 (2025-05-03 21:11:43 ago) Jun 30 2011 compare
15,018 0.2.8 (2018-09-28 04:58:18 ago) Dec 28 2012 compare
961 0.18.0 (2024-11-08 14:59:22 ago) Oct 27 2015 compare
140,026 2021.12.17 (2021-12-16 19:02:14 ago) Feb 22 2012 compare
3,670 0.12.0 (2026-02-14 21:00:12 ago) Oct 20 2013 compare
56,813 0.4.1743 (2025-01-04 01:51:10 ago) Sep 01 2012 compare
12,807 1.1.9 (2018-10-21 03:39:17 ago) Aug 24 2018 compare
297 2.0.7 (2022-11-06 07:33:14 ago) Dec 11 2020 compare

Other Languages

2,824 v1.3.0 (2024-03-01 03:34:34 ago) Apr 20 2016 compare
2,103 v4.4.15 (2025-01-02 16:53:09 ago) Oct 26 2013 compare
353 2.0.42 (2025-01-04 06:07:59 ago) Sep 11 2013 compare
769 3.5.4 (2021-01-21 09:58:10 ago) Mar 02 2013 compare
Was this page helpful?