chompjsvscssselect
chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
In web scraping this is particularly useful for parsing Javascript variables like:
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__>
elements
from nextjs (and similar) websites. Unlike json.loads
command chompjs can ingest json documents that contain
javascript natives like functions making it a super easy way to scrape hidden web data objects.
cssselect is a BSD-licensed Python library to parse CSS3 selectors and translate them to XPath 1.0 expressions.
XPath 1.0 expressions can be used in lxml or another XPath engine to find the matching elements in an XML or HTML document.
cssselect is used by other popular Python packages like parsel
and scrapy
but can also be used on it's own to generate
valid XPath 1.0 expressions for parsing HTML and XML documents in other tools.
Note that because XPath selectors are more powerful than CSS selectors this translation is only possible one way. Converting XPath to CSS selectors is impractical and not supported by cssselect.
Example Use
# basic use
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector
response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])
from cssselect import GenericTranslator, SelectorError
translator = GenericTranslator()
try:
expression = translator.css_to_xpath('div.content')
print(expression)
'descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]'
except SelectorError as e:
print(f'Invalid selector {e}')