Introduction

This blog post explores an alternative method to scraping React apps - parsing React state. Instagram will be used as a case study.

Why you shouldn't scrape HTML

HTML scrapers are fragile and can break after UI changes and A/B testing. Tools like Webpack obscure CSS classnames, making DOM traversal with XPath a pain.

Large companies like Instagram thwart scrapers by hashing their CSS classnames:

How rendering works on React

Rendering on React can be thought of as a function that takes in a state and returns HTML. State management libraries like Redux allows "actions" to be dispatched so that the state being injected changes.

React and client-side rendering

React by default is a client-side rendered application.

Dynamic data is not fetched by the first render on the client, so a spinning wheel is displayed while React sends out API calls to multiple endpoints for data.

Since HTML is rendered on the client, the HTML sent from the server is usually blank HTML that points to React code.

Why is this not good? Two reasons:

Not friendly for search engines that do not execute JS code. Time till useful content being displayed is long, because we have to wait for React to render, followed by API calls for dynamic data.

Enter server-side rendering on React

The idea of SSR is that we do the initial rendering to HTML on the server.

We send the rendered HTML to the client, alongside React, and the state that the HTML was rendered from, known as the preload state.

Here are the three things we send when we do server-side rendering:

Pre-rendered HTML React React preload state used to pre-render the HTML on the server. More accurately, it's the state of whatever state management library being used.

Why do we need to send the state, if the HTML representation of that state has already been rendered and sent to the client?

The preload state needs to be sent, so that the state management library on the client (e.g. Redux) can inject that into React on the client and continue from the state that was on the server.

The ability to inject the entire React state from one source cannot be done with React alone with ease. Thus, state management libraries like Redux and Fluxible exist and are used when server-side rendering.

Where is the preload state stored?

A common practice for server-side rendering is to put the preload state in the window object in a <script> tag of the index.html . Here is an example of the preload state of an Instagram profile:

The preload state of Instagram is stored in window._sharedData.

What does the preload state contain?

It is the state used to render what we see on the browser. Thus, it contains the fetched data we want. In the context of Instagram, it contains the list of photos we see on someone's profile page.

The preload state for Instagram stores the photos under "edge_owner_to_timeline_media" > "edges"

Each element in the "edges" array is a photo in someone's Instagram profile.

If the React app you are scraping protects their API using a CSRF token, chances are, it can be found inside the preload state. The token can be striped out and used to make calls to the website's endpoints.

Parsing the preload state

Since the preload state contains what we need, we can parse it into a data structure we can use it our language of choice.

Most languages come with a helper function to parse JSON into a hashmap/dictionary. Python has json.loads() . However it is not as straight-forward as finding the preload state and parsing the entire object as JSON. The preload state is a Javascript object, which can be thought of as a superset of JSON. JSON has certain restrictions, like no functions as values, no trailing commas, and compulsory double quotes surrounding object keys. The entire preload state in its string representation has a decent chance of not being valid JSON.

There are two ways we can parse the preload state:

Parse it as Javascript code using libraries like Espree, Acorn and Esprima into an AST. Check out https://astexplorer.net/ to see how an AST of JS code looks like. Slice out the valid JSON parts and parse as JSON to a dictionary.

I went with option 2 as no dependencies are needed.

Slicing out the valid JSON parts

Here's the Python code needed to find the Instagram photos in the preload state:

start_index = html_str.find('"edge_owner_to_timeline_media":{') first_brace_index = start_index + len('"edge_owner_to_timeline_media":{') - 1 last_brace_index = find_last_brace_index( first_brace_index, html_str ) edge_owner_to_timeline_media_str = html_str[first_brace_index:last_brace_index + 1] edge_owner_to_timeline_media_dict = json.loads(edge_owner_to_timeline_media_str)

First, we find the index of the opening brace of the object key that we want the value of. Then, we find the matching closing brace. Lastly, we slice out everything between the opening and closing brace, then parse as JSON to a Python dictionary.

Here is the algorithm used to find the index of the matching closing brace:

def find_last_brace_index(first_brace_index, html_str): if html_str[first_brace_index] != "{": raise Exception("first_brace_index is not an opening brace ({).") stack = ["{"] i = first_brace_index + 1 while len(stack) > 0: char = html_str[i] if char == '"': if stack[-1] == '"': stack.pop() else: stack.append('"') # ignore if it is an escaped quote by double incrementing elif char == '\\' and html_str[i + 1] == '"': i += 1 elif char == '{': if stack[-1] != '"': stack.append('{') elif char == '}': if stack[-1] == '{': stack.pop() i += 1 return i - 1

This problem is a variant of the Valid Parenthesis problem on Leetcode. We use a stack as the FIFO interface allows us to check if the braces match up. This algorithm ignores braces inside strings.

Now that we have the state in a Python dictionary format, we can access it all the photo URLs:

photos = edge_owner_to_timeline_media_dict["edges"] for photo in photos: photo_url = photo["node"]["display_url"] print(photo_url)

Output:

https://instagram.fsin6-1.fna.fbcdn.net/vp/21f9ea0e4e43f0f1fa41e658411791d3/5D38148A/t51.2885-15/e35/43915186_898297630358820_7730154586849148928_n.jpg?_nc_ht=instagram.fsin6-1.fna.fbcdn.net https://instagram.fsin6-1.fna.fbcdn.net/vp/0cd31937e259dc7ab589577a7908c688/5D4BA1F6/t51.2885-15/e35/43403357_762059030826467_4180930829049921536_n.jpg?_nc_ht=instagram.fsin6-1.fna.fbcdn.net https://instagram.fsin6-1.fna.fbcdn.net/vp/517476654891c2360acf3889981c0980/5CAC8A84/t51.2885-15/e15/43578334_240739026621935_2046743716898537472_n.jpg?_nc_ht=instagram.fsin6-1.fna.fbcdn.net ... ... ...

Source code

The Instagram code snippets were adapted from carousell-telebot, a Telegram bot I wrote that allowed you to listen by keyword for new product listings on Carousell.

Summary

In this blog post I went through how you can scraping React apps that are rendered on the server. Try scraping some React apps! You would be surprised how simple it is to scrape them if you can get your hands on the preload state.

Websites to scrape: