embedvscssselect
PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web service (youtube, vimeo, flickr, instagram, etc) and has adapters to some sites like (archive.org, github, facebook, etc).
cssselect is a BSD-licensed Python library to parse CSS3 selectors and translate them to XPath 1.0 expressions.
XPath 1.0 expressions can be used in lxml or another XPath engine to find the matching elements in an XML or HTML document.
cssselect is used by other popular Python packages like parsel
and scrapy
but can also be used on it's own to generate
valid XPath 1.0 expressions for parsing HTML and XML documents in other tools.
Note that because XPath selectors are more powerful than CSS selectors this translation is only possible one way. Converting XPath to CSS selectors is impractical and not supported by cssselect.
Example Use
use Embed\Embed;
$embed = new Embed();
//Load any url:
$info = $embed->get('https://www.youtube.com/watch?v=PP1xn5wHtxE');
//Get content info
$info->title; //The page title
$info->description; //The page description
$info->url; //The canonical url
$info->keywords; //The page keywords
$info->image; //The thumbnail or main image
$info->code->html; //The code to embed the image, video, etc
$info->code->width; //The exact width of the embed code (if exists)
$info->code->height; //The exact height of the embed code (if exists)
$info->code->ratio; //The aspect ratio (width/height)
$info->authorName; //The resource author
$info->authorUrl; //The author url
$info->cms; //The cms used
$info->language; //The language of the page
$info->languages; //The alternative languages
$info->providerName; //The provider name of the page (Youtube, Twitter, Instagram, etc)
$info->providerUrl; //The provider url
$info->icon; //The big icon of the site
$info->favicon; //The favicon of the site (an .ico file or a png with up to 32x32px)
$info->publishedTime; //The published time of the resource
$info->license; //The license url of the resource
$info->feeds; //The RSS/Atom feeds
from cssselect import GenericTranslator, SelectorError
translator = GenericTranslator()
try:
expression = translator.css_to_xpath('div.content')
print(expression)
'descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]'
except SelectorError as e:
print(f'Invalid selector {e}')