feedparservsxmltodict

NOASSERTION 92 9 2,048

3.7 million (month) Jun 15 2007 6.0.11(1 year, 8 months ago)

5,577 2 88 MIT

Jul 30 2007 62.5 million (month) 0.14.2(10 months ago)

feedparser is a Python module for downloading and parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Apple’s iTunes extensions.

To use Universal Feed Parser, you will need Python 3.6 or later. Universal Feed Parser is not meant to run standalone; it is a module for you to use as part of a larger Python program.

feedparser can be used to scrape data feeds as it can download them and parse the XML structured data.

xmltodict is a Python library that allows you to work with XML data as if it were JSON. It allows you to parse XML documents and convert them to dictionaries, which can then be easily manipulated using standard dictionary operations.

You can also use the library to convert a dictionary back into an XML document. xmltodict is built on top of the popular lxml library and provides a simple, intuitive API for working with XML data.

Note that despite using lxml conversion speeds can be quite slow for large XML documents and in web scraping this should be used to parse specific snippets instead of whole HTML documents.

xmltodict pairs well with JSON parsing tools like jmespath or jsonpath. Alternatively, it can be used in reverse mode to parse JSON documents using HTML parsing tools like CSS selectors and XPath.

It can be installed via pip by running pip install xmltodict command.

Example Use

import feedparser

# the feed can be loaded from a remote URL
data = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
# local path
data = feedparser.parse('/home/user/data.xml')
# or raw string
data = feedparser.parse('<xml>...</xml>')

# the result dataset is a nested python dictionary containing feed data:
data['feed']['title']

import xmltodict

xml_string = """
<book>
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <publisher>Charles Scribner's Sons</publisher>
    <publication_date>1925</publication_date>
</book>
"""

book_dict = xmltodict.parse(xml_string)
print(book_dict)
{'book': {'title': 'The Great Gatsby',
'author': 'F. Scott Fitzgerald',
'publisher': "Charles Scribner's Sons",
'publication_date': '1925'}}

# and to reverse:
book_xml = xmltodict.unparse(book_dict)
print(book_xml)

# the xml can be loaded and parsed using parsel or beautifulsoup:
from parsel import Selector
sel = Selector(book_xml)
print(sel.css('publication_date::text').get())
'1925'

Alternatives / Similar

sax-js

1,101 compare

parse5

3,698 compare

htmlparser2

4,529 compare

beautifulsoup

- compare

lxml

2,737 compare

xmltodict

5,577 compare

cheerio

28,873 compare

html5lib

1,153 compare

cssselect

293 compare

nokogiri

6,173 compare

pyquery

2,312 compare

parsel

1,187 compare

requests-html

13,780 compare

xml2

221 compare

rvest

1,498 compare

selectolax

1,186 compare

html5-php

1,638 compare

untangle

619 compare

domcrawler

3,985 compare

goquery

14,273 compare

xpath

699 compare

cascadia

717 compare

soup

2,191 compare

htmlquery

760 compare

html5-parser

683 compare

chompjs

202 compare

gazpacho

764 compare

embed

2,103 compare

chopper

22 compare

ralger

156 compare

untangle

619 compare