Skip to content

chompjsvshtml5-php

MIT 4 1 175
15.5 thousand (month) Jul 30 2007 1.2.3(2 months ago)
1,460 5 32 MIT
2.9.0(6 days ago) Jun 01 2013 158.2 thousand (month)

chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.

In web scraping this is particularly useful for parsing Javascript variables like:

import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__> elements from nextjs (and similar) websites. Unlike json.loads command chompjs can ingest json documents that contain javascript natives like functions making it a super easy way to scrape hidden web data objects.

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features:

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer

Note that html5-php is a low-level HTML parser and does not feature any query features like CSS selectors.

Example Use


# basic use
import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector

response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])
<?php
// Assuming you installed from Composer:
require "vendor/autoload.php";

use Masterminds\HTML5;

// An example HTML document:
$html = <<< 'HERE'
  <html>
  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>
HERE;

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
print $html5->saveHTML($dom);

// Or save it to a file:
$html5->save($dom, 'out.html');

Alternatives / Similar