domcrawlervschompjs
DOMCrawler library is part of the Symfony Components project and provides an easy way to traverse and manipulate HTML and XML documents using the Document Object Model (DOM) in PHP.
DOMcrawler supports both CSS selectors and XPath for HTML document parsing and is one the most popular HTML parsing tools used in web scraping with PHP.
chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
In web scraping this is particularly useful for parsing Javascript variables like:
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__>
elements
from nextjs (and similar) websites. Unlike json.loads
command chompjs can ingest json documents that contain
javascript natives like functions making it a super easy way to scrape hidden web data objects.
Example Use
use Symfony\Component\DomCrawler\Crawler;
$html = '<html><body><h1 class="title">Hello World</h1></body></html>';
$crawler = new Crawler($html);
// Find all elements using CSS selectors
$elements = $crawler->filter('.title')i;
// or XPath
$elements = $crawler->filterXPath('//h1');
// Print the text content of the elements
foreach ($elements as $element) {
echo $element->textContent;
}
# basic use
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector
response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])