you-getvstrafilatura

NOASSERTION 381 13 56,813

27.4 thousand (month) Sep 01 2012 0.4.1743(2025-01-04 01:51:10 ago)

5,650 4 107 Apache-2.0

Jul 17 2019 5.2 million (month) 2.0.0(2024-12-03 15:23:21 ago)

you-get is a command-line utility and a library for downloading multimedia content from various websites, such as YouTube, Instagram, TikTok, and many others. It supports a wide range of video and audio formats, and can be used to download both live streams and on-demand videos. The library is written in Python and can be easily integrated into other Python projects.

Just like Youtube-dl, you-get contains open-source scrapers for hundreds of websites and is a great tool to learn about web scraping and popular web scraping techniques.

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.

Example Use

CLI: ```shell $ you-get 'https://www.youtube.com/watch?v=jNQXAC9IVRw' site: YouTube title: Me at the zoo stream: - itag: 43 container: webm quality: medium size: 0.5 MiB (564215 bytes) # download-with: you-get --itag=43 [URL] Downloading Me at the zoo.webm ... 100% ( 0.5/ 0.5MB) ├██████████████████████████████████┤[1/1] 6 MB/s Saving Me at the zoo.en.srt ... Done. ``` Library: ```python import you_get # will save file to `output_video.mp4` you_get.download("https://www.youtube.com/watch?v=dQw4w9WgXcQ", -o "output_video.mp4") ```

```python # it can be used to clean HTML files from trafilatura import clean_html html = 'My Title

This is some bold text.

' cleaned_html = clean_html(html) print(cleaned_html) # can strip away tags: clean_html(html, tags_to_remove=["title"]) # or attributes clean_html(html, attributes_to_remove=["title"]) ```