chompjsvsparsel
chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
In web scraping this is particularly useful for parsing Javascript variables like:
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__>
elements
from nextjs (and similar) websites. Unlike json.loads
command chompjs can ingest json documents that contain
javascript natives like functions making it a super easy way to scrape hidden web data objects.
parsel
is a library for parsing HTML and XML using selectors, similar to beautifulsoup
. It is built on top of the lxml
library and allows for easy extraction of data from HTML and XML files using selectors, similar to how you would use CSS selectors in web development. It is a light-weight library which is specifically designed for web scraping and parsing, so it is more efficient and faster than beautifulsoup
in some use cases.
Some of the key features of parsel
include:
- CSS selector & XPath selector support:
Two most common html parsing path languages are both supported in parsel. This allows selecting attributes, tags, text and complex matching rules that use regular expressions or XPath functions. - Modifying data:
parsel
allows you to modify the contents of an element, remove elements or add new elements to a document. - Support for both HTML and XML:
parsel
supports both HTML and XML documents and you can use the same selectors for both formats.
It is easy to use and less verbose than beautifulsoup, so it's quite popular among the developers who are working with Web scraping projects and parse data from large volume of web pages.
Highlights
Example Use
# basic use
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector
response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])
from parsel import Selector
# this is our HTML page:
html = """
<head>
<title>Hello World!</title>
</head>
<body>
<div id="product">
<h1>Product Title</h1>
<p>paragraph 1</p>
<p>paragraph2</p>
<span class="price">$10</span>
</div>
</body>
"""
selector = Selector(html)
# we can use CSS selectors:
selector.css("#product .price::text").get()
"$10"
# or XPath:
selector.xpath('//span[@class="price"]').get()
"$10"
# or get all matching elements:
print(selector.css("#product p::text").getall())
["paragraph 1", "paragraph2"]
# parsel also comes with utility methods like regular expression parsing:
selector.xpath('//span[@class="price"]').re("\d+")
["10"]