Skip to content

requests-htmlvschompjs

MIT 214 2 13,558
1.1 million (month) Feb 25 2018 0.10.0(5 years ago)
175 1 4 MIT
1.2.3(2 months ago) Jul 30 2007 15.5 thousand (month)

requests-html is a Python package that allows you to easily make HTTP requests and parse the HTML content of web pages. It is built on top of the popular requests package and uses the html parser from the lxml library, which makes it fast and efficient. This package is designed to provide a simple and convenient API for web scraping, and it supports features such as JavaScript rendering, CSS selectors, and form submissions.

It also offers a lot of functionalities such as cookie, session, and proxy support, which makes it an easy-to-use package for web scraping and web automation tasks.

In short requests-html offers:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support

chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.

In web scraping this is particularly useful for parsing Javascript variables like:

import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

In practice this can be used to extract hidden JSON data like data from <script id=__NEXT_DATA__> elements from nextjs (and similar) websites. Unlike json.loads command chompjs can ingest json documents that contain javascript natives like functions making it a super easy way to scrape hidden web data objects.

Example Use


from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.example.com')

# print the HTML content of the page
print(r.html.html)

# use CSS selectors to find specific elements on the page
title = r.html.find('title', first=True)
print(title.text)
# basic use
import chompjs
js = """
  var myObj = {
    myMethod: function(params) {
    // ...
    },
    myValue: 100
  }
"""
chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

# example how to use with hidden data parsing:
import httpx
import chompjs
from parsel import Selector

response = httpx.get("http://example.com")
hidden_script = Selector(response.text).css("script#__NEXT_DATA__::text").get()
data = chompjs.parse_js_object(hidden_script)
print(data['props'])

Alternatives / Similar