Javascript
Modern websites use more and more javascript which is not executed by web scrapers unless browser automation is used.
This results in a common issue: the web-scraper sees a different HTML compared to the user.
Since http-based scrapers don't have the web page and browser contexts even if they can run javascript the page will not render the same way it would on a browser.
There are two ways to approach dynamic javascript web pages:
- use browser automation or web scraping API to render the page for us.
- reverse engineer javascript behavior and replicate it in our scraper.
The former is resource expensive and can be slow, the latter requires more development time but can be significantly faster and even simplify the web scraping process.
In this section, we'll take a look at a common way to deal with dynamic javascript pages through reverse engineering.
JS Variables
If the page is not storing its data in the HTML tree, where is it?
The first possibility is javascript variables. A common dynamic page loading progression is:
- Store data cache in a variable in a
<script>
element - On page load, use javascript to expand variable data to HTML elements.
This is called client-side rendering and when it comes web scraping it means instead of parsing the HTML our scrapers can simply grab this cache data from the <script>
element.
For example, below is a dynamic product rendered by javascript:
If we take a look at the HTML we can see there's a javascript variable that is being rendered into HTML on page load:
<div id="product">
<!-- placeholder element - will be filled up by JS -->
</div>
<script>
var data = {"name": "Awesome Product", "price": "85.16"};
document.querySelector('#product').innerHTML = `
<div class="product-name">${data.name}</div>
</div class="product-price">${data.price}</div>
`
</script>
So, even without having the capability to render Javascript our scraper can easily scrape this data by selecting the <script>
contents and extracting the data
variable.
Scrape This?
This example scrapes dynamic javascript variables with Python httpx and parsel:
import json
import httpx
from parsel import Selector
# retrieve the static HTML page
response = httpx.get("https://webscraping.fyi/web-tech/javascript/#js-variables")
assert response.status_code == 200 # otherwise request failed
# find script element that contains "var data" in it
sel = Selector(text=response.text)
script = sel.xpath("//script[contains(., 'var data')]/text()")
# extract JSON value from the script element
data = script.re('var data = (.+?);')[0]
# load is as a Python dictionary
data = json.loads(data)
print(data)
{
'name': 'Awesome Product',
'price': '85.16'
}
Background Requests
If the data is not available in the HTML or it's not hidden in the <script>
tags then it's likely being downloaded through background requests after the page loads.
For example, below is a dynamic product rendered by javascript:
If we take a look at the HTML we can see there's a fetch()
request being made that upon completion renders the retrieved data into the HTML:
<div id="product-xhr">
<!-- placeholder element - will be filled up by JS -->
</div>
<script>
document.addEventListener("DOMContentLoaded", function() {
const url = new URL(document.location);
url.pathname = "/extra/product.json";
fetch(url)
.then( response => response.json())
.then( data => {
document.querySelector('#product-xhr').innerHTML = `
<div class="product-name">${data.name}</div>
</div class="product-price">${data.price}</div>
`
});
});
</script>
These requests are called XHR (XHMLHttpRequests) which we can be observed using developer tools. Then, these requests can be replicated in the web-scraper.
When scraping background requests special attention should be paid to session cookies and request headers like x-csrf
, Origin
, Referer
and even content-type. In other words - ensure that scraper's requests match the one's seen in the browser.
Scrape This?
Note that the best way to scrape this is to use developer tools to find the hidden API and replicate it in python This example scrapes hidden API source with Python httpx:
import httpx
response = httpx.get("https://webscraping.fyi/extra/product.json")
print(response.json())
{
'name': 'XHR Product',
'price': 55.39
}