feedparservshtml5lib
feedparser is a Python module for downloading and parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Appleās iTunes extensions.
To use Universal Feed Parser, you will need Python 3.6 or later. Universal Feed Parser is not meant to run standalone; it is a module for you to use as part of a larger Python program.
feedparser can be used to scrape data feeds as it can download them and parse the XML structured data.
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml
(like parsel
or beautifulsoup
).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
Example Use
import feedparser
# the feed can be loaded from a remote URL
data = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
# local path
data = feedparser.parse('/home/user/data.xml')
# or raw string
data = feedparser.parse('<xml>...</xml>')
# the result dataset is a nested python dictionary containing feed data:
data['feed']['title']
import html5lib
from html5lib import parse
html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)