Skip to content

html5libvschompjs

MIT License 83 14 1,092
18.9 million (month) Jul 30 2007 1.1(3 years ago)
175 1 4 MIT
1.2.3(2 months ago) Jul 30 2007 15.5 thousand (month)

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.

In web scraping this is particularly useful for parsing Javascript variables like:

import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__> elements from nextjs (and similar) websites. Unlike json.loads command chompjs can ingest json documents that contain javascript natives like functions making it a super easy way to scrape hidden web data objects.

Example Use


import html5lib
from html5lib import parse

html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)
# basic use
import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector

response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])

Alternatives / Similar