Skip to content

xml2vschompjs

MIT 63 3 216
624.5 thousand (month) Apr 20 2015 1.3.6(9 months ago)
175 1 4 MIT
1.2.3(2 months ago) Jul 30 2007 15.5 thousand (month)

The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.

xml2 can be used to parse HTML documents using XPath selectors and is a successor to R's XML package with a few improvements:

  • xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
  • xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
  • More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.

chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.

In web scraping this is particularly useful for parsing Javascript variables like:

import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__> elements from nextjs (and similar) websites. Unlike json.loads command chompjs can ingest json documents that contain javascript natives like functions making it a super easy way to scrape hidden web data objects.

Example Use


library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x

xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")

h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
# basic use
import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector

response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])

Alternatives / Similar