Since a good number of our customers use our serverless platform to more easily deploy and scale their web bots and scrapers, I thought I’d write a post about a fun scraping challenge I encountered. Solving it required thinking a little bit outside the box. I thought I’d share it here since it demonstrates a fairly re-usable approach to scraping heavily-obfuscated sites. This post will dive into how you can use request interception in Puppeteer to beat heavily obfuscated sites that are built to be resistant to scraping.

NOTE: This post is meant to be about the technical content and not about actually scraping the given site. For that reason the site is redacted throughout this post. We don't advocate you go and scrape any individual site, you should always use your best judgement when engaging in any scraping project.

Background: The Problem

Note: Feel free to skip this section if you don’t care about the background for why the scraping was necessary.

Recently, I was working on a project to calculate the most optimal route for spinning Pokestops in the mobile game Pokemon Go (a topic for another post). For those unfamiliar with Pokemon Go, you can obtain in-game items by physically visiting “Pokestop” locations in the real world and “spinning” them in the app.

Clip from https://www.youtube.com/watch?v=6GTF-tc7mjY

Since the entire game hinges on having enough resources (Pokeballs, eggs, berries, etc) spinning as many Pokestops as possible is a shared-goal for basically all players.



Pokestops are designed to encourage players to constantly be on the move instead of standing around. For example, once you spin a Pokestop you can’t spin it again for five minutes, so you always want to be heading to the next-nearest stop. Even if you wait around to re-spin the same stop, you are only awarded one fifth of the experience points for doing so (50 XP instead of 250 XP for a new stop). Suffice to say, to get the most bang for your buck you should be minimizing the amount of time walking between Pokestops and maximizing the number of unique Pokestops spun.



Given routing problems are something computers have gotten pretty good at, I figured if I had the GPS coordinates of all the Pokestops near me I could calculate the best route that hits all of them in the shortest distance possible. However, this is easier said than done, as Niantic (the creators of the game) are extremely aggressive on banning scrapers from the game.



Risking a ban was not something I was interested in doing. Luckily, there are community-efforts to build out an effective map of all Pokestops in the game. PokestopMap (the codeword we'll use to reference the site) is one such website, which allows players to mark and confirm Pokestops that they see in the app. This community-built map offers a great data source for where to find Pokestops:



Map detailing all of the Pokestops and Gyms in San Francisco

In order to get the data for the routing portion of the project, we need to scrape this website for the GPS coordinates of all of these stops. How hard could that be?

Obfuscation, Obfuscation Everywhere!

The first thing I took a look at was just automating the HTTP requests made by the webpage to the backend API. Taking a look at the XMLHTTPRequests made by the page, the following request/response seemed to be what I was looking for:

<p>CODE: https://gist.github.com/mandatoryprogrammer/c7248325318c15a140d14b881643029b.js</p>



This seems easy enough, but taking a closer look at the response JSON we notice something troubling:



<p>CODE: https://gist.github.com/mandatoryprogrammer/96a3e7df38ef4985feb28086c848c615.js</p>



The fields which should contain the GPS coordinates appear to be obfuscated. The keys are seemingly random strings and the values are base64-encoded floats which do not match up with the coordinates of the points on the map. The “realrand” field also gives us a hint that this response data is intentionally obfuscated to prevent scraping. Bummer.



The rabbit hole is for the rabbits, let’s keep it that way.

Often when encountering obfuscation the first response will be to go down the rabbit-hole that is reverse-engineering the app’s client-side JavaScript. From my experience, this should be an absolute last resort for a couple of reasons:

It’s often trivial for a website to change the JavaScript obfuscation, and it’s time-intensive for us to reverse engineer it again (five minutes vs potentially hours). The battle is not in your favor.

In addition to it being easy to change the obfuscation, the obfuscation doesn’t actually aid the app function in any way. As with all scraping, anything you use as a point-of-reference to scrape a webpage should be as unlikely to change as possible. This is because every change of your points-of-reference will break your bot (which you’ll have to fix!). This is true regardless of it being a CSS selector, or an obfuscated JavaScript file. Scraping by using points-of-reference that aid the app make it so your hard-work is less likely to be broken by site changes.

It requires a lot of experience with JavaScript and reverse engineering to accomplish. There are plenty of anti-debugging tricks which can make the whole process very painful for people looking to try.



That’s far too much work for us, let’s work smart instead of hard (OK reverse engineering requires plenty of smarts, but regardless…). We’ll do this by tackling the problem at a different layer: the browser.

Full Browser Scraping with Puppeteer vs Raw HTTP Requests

When working on a scraping project you often have to make the choice of going the route of doing raw HTTP requests or using a full web browser to do the scraping.



The pros of going with raw HTTP requests are generally:

Speed of scraping : Generally scraping via raw HTTP requests is much faster because you can write a script to only request exactly the endpoints necessary to scrape the data you need. You can skip all of the unnecessary steps (loading webpage resources, rendering DOM elements, etc).

: Generally scraping via raw HTTP requests is much faster because you can write a script to only request exactly the endpoints necessary to scrape the data you need. You can skip all of the unnecessary steps (loading webpage resources, rendering DOM elements, etc). Stability : Put simply, doing raw HTTP requests and reading the responses can often be much more stable compared to using a full web browser. A full web browser loads the page, loads all of the page’s resources, evaluates JavaScript, renders DOM elements, and much more. This all introduces variability which can often break your bot, whether through some DOM and AJAX timing, or from a bad ad tanking your browser.

: Put simply, doing raw HTTP requests and reading the responses can often be much more stable compared to using a full web browser. A full web browser loads the page, loads all of the page’s resources, evaluates JavaScript, renders DOM elements, and much more. This all introduces variability which can often break your bot, whether through some DOM and AJAX timing, or from a bad ad tanking your browser. Easier to Scale: Spinning up a bunch of concurrent HTTP requests is usually much easier than spinning up a bunch of concurrent web browsers. However, you can spin up thousands of browsers in Refinery by returning an array, so this point is usually moot for me.



The pros of going with headless browser scraping are generally:

Less time spent on reverse engineer the app : Since you’re using a web browser to scrape a site, you’re using the page how a legitimate user uses it. This means that you don’t have to reverse engineer the JavaScript of the app to figure out how to set up HTTP requests to scrape the data you want. You can instead just scrape the data from the DOM, extract it from the page’s variables, or call existing in-app functions.

: Since you’re using a web browser to scrape a site, you’re using the page how a legitimate user uses it. This means that you don’t have to reverse engineer the JavaScript of the app to figure out how to set up HTTP requests to scrape the data you want. You can instead just scrape the data from the DOM, extract it from the page’s variables, or call existing in-app functions. Stealth : A regular browser loading all of the resources of the page is often much more stealthy than just doing raw HTTP requests. When doing raw HTTP requests, are you being careful to ensure your headers are ordered like a browser? Are you making sure to request the same resources as the browser? Are your cookies handled correctly? Doing any of these things differently than a browser can lead to detection, and there are endless tricks to fingerprint your HTTP client*.

: A regular browser loading all of the resources of the page is often much more stealthy than just doing raw HTTP requests. When doing raw HTTP requests, are you being careful to ensure your headers are ordered like a browser? Are you making sure to request the same resources as the browser? Are your cookies handled correctly? Doing any of these things differently than a browser can lead to detection, and there are endless tricks to fingerprint your HTTP client*. Visual debugging: Using a full web browser is often easier just because you get to see the full picture of what’s happening in the browser much more clearly. You also have a wide variety of built-in developer tools for Chrome which are extremely helpful for scraping and debugging.



*To be fair, fingerprinting a browser to figure out if it’s a bot is also entirely possible (there are even more data points to fingerprint if it’s a full browser vs a low-level HTTP client). However, this is less common and ultimately emulating the “real case” is almost always going to be the most stealthy method possible.



Generally speaking, the more complicated/obfuscated the web app, the quicker I’ll just reach for utilizing a full web browser.

“What would be the most painful to change?”: Choosing a Stable Scraping Technique

Doing a quick assessment of the site, I assessed a couple of potential routes for browser-level scraping:



Calling existing-app functions to extract the coordinate data : Painful. Multiple layers of obfuscation of the JavaScript are used.

: Painful. Multiple layers of obfuscation of the JavaScript are used. Doing raw HTTP requests : Painful. The API requests and responses are obfuscated as discussed earlier in the post. Additionally detection and blocking was implemented for HTTP clients not doing proper request formatting.

: Painful. The API requests and responses are obfuscated as discussed earlier in the post. Additionally detection and blocking was implemented for HTTP clients not doing proper request formatting. Extracting data from webpage JavaScript variables : Painful. From a quick examination, the JavaScript variables were also obfuscated. Finding and accessing them would be a long route to take, not to mention often JavaScript variables will just never be scoped globally to prevent memory leaks.

: Painful. From a quick examination, the JavaScript variables were also obfuscated. Finding and accessing them would be a long route to take, not to mention often JavaScript variables will just never be scoped globally to prevent memory leaks. Extracting data from the webpage DOM: Painful. The app appears to take pretty careful measures to ensure the GPS coordinates of all of the points are not inserted into the DOM.

‍

‍

All things being equal, the developers of PokestopMap did a really good job making it hard to scrape their site. All of the low hanging fruit seems to be well trimmed, looks like we’ll have to get creative!



Taking a closer look at the page HTML I noticed that while their main web application JavaScript is heavily obfuscated, the JavaScript mapping library they use “leaflet.js” was not (although, it was minified):



<p>CODE: https://gist.github.com/mandatoryprogrammer/0615880dada2dc682f6b6d0c0dbaffc8.js</p>







Leaflet.js is an open-source JavaScript library for making interactive maps that are mobile-friendly. It provides an easy API for adding map markers, and other map-related functionality.



PokestopMap clearly uses Leaflet to draw it’s interactive Pokestop map. This is good, since it means that the data we’re trying to extract must be handed over to Leaflet in an un-obfuscated form at some point to accomplish this. Doing a quick search through the API documentation and we find the function for adding Markers:







Perfect, so if we hook the L.Marker function we’ll be able to intercept all of the metadata used to create the map markers. This will have the GPS coordinates we need to scrape (and likely extra metadata as well).

Popping open the developer tools and doing some quick searches for “marker” confirms this. After setting a breakpoint for the minified function, we see the data we’re looking for:





Great! But it’s one thing to see the data in a breakpoint in the Chrome developer console and another to automate scraping it. How can we use this to automate scraping all of the coordinates for our area?



Hot-Swapping Webpage JavaScript Libraries/Assets with Request Interception in Puppeteer

For those unfamiliar with Puppeteer, it’s an awesome JavaScript library to control and automate the Chrome web browser. Every-time I read through the Puppeteer API docs it seems they’ve added some new API or way to control the browser. When it comes to doing full browser scraping it’s hard to recommend any other library for the job.



One of Puppeteer’s underrated APIs is page.setRequestInterception. It allows you to intercept HTTP requests made by the browser and modify the request and the response data. This is something you can’t do even in Chrome extensions, since the Chrome extension API for working with web requests is brutally limited.



By using this library we can replace the Leaflet.js library with a version that is modified to do a little extra. Of course by “extra” I mean record all of the data passed to the “addMarker” API calls. To avoid having to compile the Leaflet.js project, I just downloaded the existing minified leaflet.js file and changed the initialize call we breakpointed earlier to the following:



<p>CODE: https://gist.github.com/mandatoryprogrammer/98bed159798fe56fa39189d99e78be5d.js</p>

Now using the page.setRequestInterception API we can swap the JavaScript library when Chrome loads the webpage. The following code snippet demonstrates this:



<p>CODE: https://gist.github.com/mandatoryprogrammer/09355d32f29da8ca251429cd98dacc51.js</p>



Now when our Chrome browser loads the website the Leaflet.js script will be replaced with our modified version which will dump the GPS coordinates and other metadata into the global variable “window.dumpedPoints”. We can then just do a page.evaluate() to get the results:



<p>CODE: https://gist.github.com/mandatoryprogrammer/16fe7c0c634edcc0e31e42db91724682.js</p>



We’ve now scraped all of the data we’re looking for. All that’s required to get the GPS points for Pokestops in our area is to launch our script with some seed GPS coordinates as parameters and we’ll receive an array of GPS points of all nearby Pokestops.



For those interested in seeing the final code for this, check out this Refinery project.

Bonus Technique: Speed Up Browser Scraping By Stubbing Out Assets

In addition to being useful for scraping, the page.setRequestInterception API is also useful for speeding up browser-level scraping. Using this API you can intercept requests for assets and third-party resources and skip loading them. You can do this by either calling interceptedRequest.abort() to just return a network error, or by returning a response containing the actual data for the asset.

The following code snippet demonstrates using the API to skip loading all PNG and JPG images for a given page:

<p>CODE: https://gist.github.com/mandatoryprogrammer/3d0aa7b26f379ed4b424ecd9f274e820.js</p>

Since generally most of a page's load times are spent on pulling external assets this can be a great way to speed up and increase the stability of your browser-level scraping. You can use this to implement an effective "whitelist" of requests that are necessary for your scraping and make all the non-necessary requests load instantly.

----

By Matthew Bryant (@IAmMandatory)

CEO @ Refinery.io

