Skip to content

Hidden Web Data

Web pages often contain structured data that is not visible on the page but is embedded in the HTML source. This data is easier and faster to extract than parsing visual elements, and it is often more complete.

Interactive lesson

This topic is covered in the Scrapfly Academy: Hidden Web Data lesson.

Where to Find Hidden Data

1. JSON-LD in Script Tags

Many websites embed structured data using JSON-LD (Linked Data) for SEO purposes. This is the richest source of hidden data.

```html

```

Extract it with:

```python import json from parsel import Selector

sel = Selector(response.text) json_ld = sel.css('script[type="application/ld+json"]::text').getall() for data in json_ld: parsed = json.loads(data) print(parsed) ```

2. JavaScript Variables

Data is often assigned to JavaScript variables that hydrate the page:

```html

```

Extract with regex or chompjs:

```python import re import json

Using regex

match = re.search(r'window.INITIAL_STATE\s=\s({.+?});', response.text) if match: data = json.loads(match.group(1))

Using chompjs (handles JavaScript objects that aren't valid JSON)

import chompjs data = chompjs.parse_js_object(script_text) ```

3. Meta Tags

Open Graph and Twitter Card meta tags contain page metadata:

```python sel = Selector(response.text)

title = sel.css('meta[property="og:title"]::attr(content)').get() description = sel.css('meta[property="og:description"]::attr(content)').get() image = sel.css('meta[property="og:image"]::attr(content)').get() price = sel.css('meta[property="product:price:amount"]::attr(content)').get() ```

4. Microdata and RDFa

HTML5 microdata uses itemscope, itemtype, and itemprop attributes:

```html

Widget Pro

$29.99

```

python name = sel.css('[itemprop="name"]::text').get() price = sel.css('[itemprop="price"]::text').get()

For comprehensive microdata extraction, use extruct which can extract JSON-LD, microdata, RDFa, and Open Graph in a single call.

5. Data Attributes

Developers often store data in custom data-* attributes:

python product_id = sel.css(".product::attr(data-product-id)").get() price = sel.css(".product::attr(data-price)").get() category = sel.css(".product::attr(data-category)").get()

Why Use Hidden Data?

Advantage Explanation
More complete Hidden data often contains fields not shown on the page
Already structured JSON-LD and JavaScript objects are already in a parseable format
More reliable Less likely to break when the page layout changes
Faster to parse No complex CSS/XPath selectors needed
Works without JS Available in the initial HTML, no browser needed

Next Steps

Was this page helpful?