Hidden Web Data
Web pages often contain structured data that is not visible on the page but is embedded in the HTML source. This data is easier and faster to extract than parsing visual elements, and it is often more complete.
Interactive lesson
This topic is covered in the Scrapfly Academy: Hidden Web Data lesson.
Where to Find Hidden Data
1. JSON-LD in Script Tags
Many websites embed structured data using JSON-LD (Linked Data) for SEO purposes. This is the richest source of hidden data.
```html
```
Extract it with:
```python import json from parsel import Selector
sel = Selector(response.text) json_ld = sel.css('script[type="application/ld+json"]::text').getall() for data in json_ld: parsed = json.loads(data) print(parsed) ```
2. JavaScript Variables
Data is often assigned to JavaScript variables that hydrate the page:
```html
```
Extract with regex or chompjs:
```python import re import json
Using regex
match = re.search(r'window.INITIAL_STATE\s=\s({.+?});', response.text) if match: data = json.loads(match.group(1))
Using chompjs (handles JavaScript objects that aren't valid JSON)
import chompjs data = chompjs.parse_js_object(script_text) ```
3. Meta Tags
Open Graph and Twitter Card meta tags contain page metadata:
```python sel = Selector(response.text)
title = sel.css('meta[property="og:title"]::attr(content)').get() description = sel.css('meta[property="og:description"]::attr(content)').get() image = sel.css('meta[property="og:image"]::attr(content)').get() price = sel.css('meta[property="product:price:amount"]::attr(content)').get() ```
4. Microdata and RDFa
HTML5 microdata uses itemscope, itemtype, and itemprop attributes:
```html
Widget Pro
$29.99```
python
name = sel.css('[itemprop="name"]::text').get()
price = sel.css('[itemprop="price"]::text').get()
For comprehensive microdata extraction, use extruct which can extract JSON-LD, microdata, RDFa, and Open Graph in a single call.
5. Data Attributes
Developers often store data in custom data-* attributes:
python
product_id = sel.css(".product::attr(data-product-id)").get()
price = sel.css(".product::attr(data-price)").get()
category = sel.css(".product::attr(data-category)").get()
Why Use Hidden Data?
| Advantage | Explanation |
|---|---|
| More complete | Hidden data often contains fields not shown on the page |
| Already structured | JSON-LD and JavaScript objects are already in a parseable format |
| More reliable | Less likely to break when the page layout changes |
| Faster to parse | No complex CSS/XPath selectors needed |
| Works without JS | Available in the initial HTML, no browser needed |
Next Steps
- HTML Parsing - extracting data from visual page elements
- JSON Parsing - tools for processing extracted JSON
- Scrapfly Academy: Hidden Web Data - interactive lesson