embedvsxmltodict

MIT 71 6 2,103

5.2 thousand (month) Oct 26 2013 v4.4.15(7 months ago)

5,577 2 88 MIT

Jul 30 2007 62.5 million (month) 0.14.2(10 months ago)

PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web service (youtube, vimeo, flickr, instagram, etc) and has adapters to some sites like (archive.org, github, facebook, etc).

xmltodict is a Python library that allows you to work with XML data as if it were JSON. It allows you to parse XML documents and convert them to dictionaries, which can then be easily manipulated using standard dictionary operations.

You can also use the library to convert a dictionary back into an XML document. xmltodict is built on top of the popular lxml library and provides a simple, intuitive API for working with XML data.

Note that despite using lxml conversion speeds can be quite slow for large XML documents and in web scraping this should be used to parse specific snippets instead of whole HTML documents.

xmltodict pairs well with JSON parsing tools like jmespath or jsonpath. Alternatively, it can be used in reverse mode to parse JSON documents using HTML parsing tools like CSS selectors and XPath.

It can be installed via pip by running pip install xmltodict command.

Example Use

use Embed\Embed;

$embed = new Embed();

//Load any url:
$info = $embed->get('https://www.youtube.com/watch?v=PP1xn5wHtxE');

//Get content info

$info->title; //The page title
$info->description; //The page description
$info->url; //The canonical url
$info->keywords; //The page keywords

$info->image; //The thumbnail or main image

$info->code->html; //The code to embed the image, video, etc
$info->code->width; //The exact width of the embed code (if exists)
$info->code->height; //The exact height of the embed code (if exists)
$info->code->ratio; //The aspect ratio (width/height)

$info->authorName; //The resource author
$info->authorUrl; //The author url

$info->cms; //The cms used
$info->language; //The language of the page
$info->languages; //The alternative languages

$info->providerName; //The provider name of the page (Youtube, Twitter, Instagram, etc)
$info->providerUrl; //The provider url
$info->icon; //The big icon of the site
$info->favicon; //The favicon of the site (an .ico file or a png with up to 32x32px)

$info->publishedTime; //The published time of the resource
$info->license; //The license url of the resource
$info->feeds; //The RSS/Atom feeds

import xmltodict

xml_string = """
<book>
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <publisher>Charles Scribner's Sons</publisher>
    <publication_date>1925</publication_date>
</book>
"""

book_dict = xmltodict.parse(xml_string)
print(book_dict)
{'book': {'title': 'The Great Gatsby',
'author': 'F. Scott Fitzgerald',
'publisher': "Charles Scribner's Sons",
'publication_date': '1925'}}

# and to reverse:
book_xml = xmltodict.unparse(book_dict)
print(book_xml)

# the xml can be loaded and parsed using parsel or beautifulsoup:
from parsel import Selector
sel = Selector(book_xml)
print(sel.css('publication_date::text').get())
'1925'