html5libvsembed

MIT 97 14 1,220

32.8 million (month) Jul 30 2007 1.1(2020-06-22 23:32:36 ago)

2,103 6 71 MIT

Oct 26 2013 5.2 thousand (month) v4.4.15(2025-01-02 16:53:09 ago)

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web service (youtube, vimeo, flickr, instagram, etc) and has adapters to some sites like (archive.org, github, facebook, etc).

Example Use

```python import html5lib from html5lib import parse html_doc = "My Title" parsed = parse(html_doc) title = parsed.getElementsByTagName("title")[0] print(title.childNodes[0].nodeValue) ```

```javascript use Embed\Embed; $embed = new Embed(); //Load any url: $info = $embed->get('https://www.youtube.com/watch?v=PP1xn5wHtxE'); //Get content info $info->title; //The page title $info->description; //The page description $info->url; //The canonical url $info->keywords; //The page keywords $info->image; //The thumbnail or main image $info->code->html; //The code to embed the image, video, etc $info->code->width; //The exact width of the embed code (if exists) $info->code->height; //The exact height of the embed code (if exists) $info->code->ratio; //The aspect ratio (width/height) $info->authorName; //The resource author $info->authorUrl; //The author url $info->cms; //The cms used $info->language; //The language of the page $info->languages; //The alternative languages $info->providerName; //The provider name of the page (Youtube, Twitter, Instagram, etc) $info->providerUrl; //The provider url $info->icon; //The big icon of the site $info->favicon; //The favicon of the site (an .ico file or a png with up to 32x32px) $info->publishedTime; //The published time of the resource $info->license; //The license url of the resource $info->feeds; //The RSS/Atom feeds ```