JSON
JSON is another popular data format encountered in web scraping. It's a text-based format that is used to store data in a key-value format. It's very similar to Python dictionaries and JavaScript objects:
{
"title": "Product",
"published": "2020-01-01",
"price": 15.99,
"tags": ["new", "sale"],
"price_by_color": {
"red": 15.99,
"yellow": 12.99,
}
}
It's a very popular format for data exchange between web applications and APIs so JSON is encountered in background request scraping and js variable scraping.
JSON is a native data format in many programming languages used in webscraping so it's an easy format to scrape regardless of the tooling.
Parsing JSON
Scraped JSON datasets can be huge, made up of several nested layers of objects and arrays. It's not uncommon to see JSON datasets with 1000+ keys so JSON parsing can quickly become very complex.
There are a few popular ways to deal with large JSON datasets. To start, recursive tools like object-scan and nested-lookup can be used to easily select nested keys:
nested-lookup provides a simple way to lookup nested keys
from nested_lookup import nested_lookup
data = {
"a": {
"b": {
"email": "foo@example.com"
}
}
}
print(nested_lookup('email', data)[0])
"foo@example.com"
object-scan provides a simple way to lookup nested keys
const objectScan = require('object-scan');
const myNestedObject = {
level1: {
level2: {
level3: {
myTargetKey: 'value',
},
},
},
};
const searchTerm = 'myTargetKey';
const result = objectScan([`**.${searchTerm}`], { joined: false })(myNestedObject);
console.log(result);
"value"
For more complex JSON parsing there are specialized query languages that can be used like CSS selectors and XPath are used for HTML:
Both jsonpath and jmespath are available in many languages though differ from each other in syntax and features.
Jsonpath
Jsonpath is more similar to HTML's XPath notation allowing powerful navigation for working with nested keys. In particular, the ..
selector (equivalent to XPath's //
) is great for working with unreliable JSON structures.
from jsonpath_ng import jsonpath, parse
data = {
"data": {
"info": {
"products": [
{"price": 1, "_type": "product", "id": "123"},
{"price": 2, "_type": "product", "id": "345"}
]
}
}
}
# select recursively for "products" key and the "price" key of each product
jsonpath_expr = parse('$..products[*].price')
for match in jsonpath_expr.find(data):
print(match.value)
# will print: 1, 2
Jmespath
On the other hand, while Jmespath doesn't support recursive selection it has powerful data reshaping capabilities:
import jmespath
data = {
"data": {
"info": {
"products": [
{"price": {"usd": 1}, "_type": "product", "id": "123"},
{"price": {"usd": 2}, "_type": "product", "id": "345"}
]
}
}
}
# easily reshape nested dataset to flat structure:
jmespath.search("data.info.products[*].{id:id, price:price.usd}", data)
[
{'id': '123', 'price': 1},
{'id': '345', 'price': 2}
]
Both of these tools are very popular in modern web scraping when it comes to JSON data parsing.