Skip to content

JSON

JSON is another popular data format encountered in web scraping. It's a text-based format that is used to store data in a key-value format. It's very similar to Python dictionaries and JavaScript objects:

{
  "title": "Product",
  "published": "2020-01-01",
  "price": 15.99,
  "tags": ["new", "sale"],
  "price_by_color": {
    "red": 15.99,
    "yellow": 12.99,
  }
}

It's a very popular format for data exchange between web applications and APIs so JSON is encountered in background request scraping and js variable scraping.

JSON is a native data format in many programming languages used in webscraping so it's an easy format to scrape regardless of the tooling.

Parsing JSON

Scraped JSON datasets can be huge, made up of several nested layers of objects and arrays. It's not uncommon to see JSON datasets with 1000+ keys so JSON parsing can quickly become very complex.

There are a few popular ways to deal with large JSON datasets. To start, recursive tools like object-scan and nested-lookup can be used to easily select nested keys:

nested-lookup provides a simple way to lookup nested keys

from nested_lookup import nested_lookup

data = {
    "a": {
        "b": {
            "email": "foo@example.com"
        }
    }
}
print(nested_lookup('email', data)[0])
"foo@example.com"

object-scan provides a simple way to lookup nested keys

const objectScan = require('object-scan');

const myNestedObject = {
  level1: {
    level2: {
      level3: {
        myTargetKey: 'value',
      },
    },
  },
};

const searchTerm = 'myTargetKey';
const result = objectScan([`**.${searchTerm}`], { joined: false })(myNestedObject);
console.log(result);
"value"

For more complex JSON parsing there are specialized query languages that can be used like CSS selectors and XPath are used for HTML:

  • jsonpath - query language similar to XPath
  • jmespath - query language similar to CSS selectors.

Both jsonpath and jmespath are available in many languages though differ from each other in syntax and features.

Jsonpath

Jsonpath is more similar to HTML's XPath notation allowing powerful navigation for working with nested keys. In particular, the .. selector (equivalent to XPath's //) is great for working with unreliable JSON structures.

from jsonpath_ng import jsonpath, parse

data = {
    "data": {
        "info": {
            "products": [
                {"price": 1, "_type": "product", "id": "123"}, 
                {"price": 2, "_type": "product", "id": "345"}
            ]
        }
    }
}
# select recursively for "products" key and the "price" key of each product
jsonpath_expr = parse('$..products[*].price')
for match in jsonpath_expr.find(data):
    print(match.value)
    # will print: 1, 2

Jmespath

On the other hand, while Jmespath doesn't support recursive selection it has powerful data reshaping capabilities:

import jmespath

data = {
  "data": {
      "info": {
          "products": [
              {"price": {"usd": 1}, "_type": "product", "id": "123"}, 
              {"price": {"usd": 2}, "_type": "product", "id": "345"}
          ]
      }
  }
}

# easily reshape nested dataset to flat structure:
jmespath.search("data.info.products[*].{id:id, price:price.usd}", data)
[
    {'id': '123', 'price': 1}, 
    {'id': '345', 'price': 2}
]

Both of these tools are very popular in modern web scraping when it comes to JSON data parsing.