Skip to content

xml2vshtml5lib

MIT 62 3 213
675.4 thousand (month) Apr 20 2015 1.3.6(7 months ago)
1,081 14 84 MIT License
1.1(3 years ago) Jul 30 2007 13.2 million (month)

The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.

xml2 can be used to parse HTML documents using XPath selectors and is a successor to R's XML package with a few improvements:

  • xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
  • xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
  • More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

Example Use


library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x

xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")

h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
import html5lib
from html5lib import parse

html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)

Alternatives / Similar