parse5vschompjs
parse5 is a Node.js library for parsing and manipulating HTML and XML documents. It is designed to be fast and flexible, and it is commonly used in web scraping and web development projects.
parse5 is used by popular libraries such as Angular, Lit, Cheerio and many more. Unlike Cheerio parse5 is a low level html parsing library that might be useful directly in web scraping without higher level abstraction.
chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
In web scraping this is particularly useful for parsing Javascript variables like:
python
import chompjs
js = """
var myObj = {
myMethod: function(params) {
// ...
},
myValue: 100
}
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__> elements
from nextjs (and similar) websites. Unlike json.loads command chompjs can ingest json documents that contain
javascript natives like functions making it a super easy way to scrape hidden web data objects.
Example Use
New Element
'); body.appendChild(newElement.childNodes[0]); console.log(parse5.serialize(document)); ```