Skip to content

Web Scraping FYI

Comparison of python trafilatura vs embera (php) libraries

trafilaturavsembera

Apache-2.0 74 4 3,791

365.7 thousand (month) Jul 17 2019 2.0.0(8 months ago)

346 2 3 MIT

Sep 11 2013 3.0 thousand (month) 2.0.42(7 months ago)

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.

Embera is an Oembed consumer library written in PHP. It takes urls from a text and queries the matching service for information about the media and embeds the resulting html. It supports +150 sites, such as Youtube, Twitter, Livestream, Dailymotion, Instagram, Vimeo and many many more.

Example Use

# it can be used to clean HTML files
from trafilatura import clean_html

html = '<html><head><title>My Title</title></head><body><p>This is some <b>bold</b> text.</p></body></html>'
cleaned_html = clean_html(html)
print(cleaned_html)

# can strip away tags:
clean_html(html, tags_to_remove=["title"])
# or attributes
clean_html(html, attributes_to_remove=["title"])

use Embera\Embera;

$embera = new Embera();
print_r($embera->getUrlData([
    'https://vimeo.com/374131624',
    'https://www.flickr.com/photos/bees/8597283706/in/photostream',
]));

// this will generate:
Array
(
    [https://vimeo.com/374131624] => Array
        (
            [type] => video
            [version] => 1.0
            [provider_name] => Vimeo
            [provider_url] => https://vimeo.com/
            [title] => VACATION movie
            [author_name] => Andrey Kasay
            [author_url] => https://vimeo.com/andreykasay
            [is_plus] => 0
            [account_type] => basic
            [html] => <iframe src="......."></iframe>
            [width] => 426
            [height] => 240
            [duration] => 146
            [description] => Остросюжетное кино про жизнь
            [thumbnail_url] => https://i.vimeocdn.com/video/832478725_295x166.jpg
            [thumbnail_width] => 295
            [thumbnail_height] => 166
            [thumbnail_url_with_play_button] => https://i.vimeocdn.com/......Fcrawler_play.png
            [upload_date] => 2019-11-19 06:27:37
            [video_id] => 374131624
            [uri] => /videos/374131624
            [embera_using_fake_response] => 0
            [embera_provider_name] => Vimeo
        )
    [https://www.flickr.com/photos/bees/8597283706/in/photostream] => Array
        (
            [type] => photo
            [flickr_type] => photo
            [title] => Durumu
            [author_name] => ‮‭‬bees‬
            [author_url] => https://www.flickr.com/photos/bees/
            [width] => 1024
            [height] => 723
            [url] => https://live.staticflickr.com/8385/8597283706_7b51ea50b1_b.jpg
            [web_page] => https://www.flickr.com/photos/bees/8597283706/
            [thumbnail_url] => https://live.staticflickr.com/8385/8597283706_7b51ea50b1_q.jpg
            [thumbnail_width] => 150
            [thumbnail_height] => 150
            [web_page_short_url] => https://flic.kr/p/e6HjVq
            [license] => All Rights Reserved
            [license_id] => 0
            [html] => .........
            [version] => 1.0
            [cache_age] => 3600
            [provider_name] => Flickr
            [provider_url] => https://www.flickr.com/
            [embera_using_fake_response] => 0
            [embera_provider_name] => Flickr
            [html_alternative] => ........
        )
)

Alternatives / Similar

html2text

1,897 compare

extruct

884 compare

newspaper

14,364 compare

2,724 compare

134,254 compare

sumy

3,548 compare

gofeed

2,641 compare

you-get

54,934 compare

embed

2,103 compare

embera

346 compare

263 compare

photon

11,149 compare

essence

768 compare

3,791 compare

134,254 compare

you-get

54,934 compare

embed

2,103 compare

essence

768 compare