For a couple of reasons, we decided to record the traffic on the client-side rather than the server. While researching ways for capturing network traffic on client-side, we started looking into Service Workers. When reading the description of Service Workers, it felt like it’s a perfect match for our needs:

Service workers essentially act as proxy servers that sit between web applications, the browser, and the network (when available). They are intended, among other things, to… …intercept network requests …

For those of you who aren’t familiar with Service Workers:

We decided to use Service Workers and thought of the following architecture:

Loadmill service-worker architecture

This way, all the customer needs to do is add a script tag in its HTML file, and we’ll do the rest. Easy, right? Well… not so easy.

First challenge — Domain restriction

As shown in the last diagram, we want our customers to load (register in Service Workers jargon) our Service Worker on its website. However, you can’t load the Service Worker’s JS from another domain.

Furthermore, the Service Worker will only listen to ‘fetch’ events triggered in the same scope you have downloaded the Service Worker from. For example — if you registered your Service Worker from www.customer-domain/scopeA the Service Worker will intercept all the www.customer-domain/scopeA/* requests. Requests for www.customer-domain/scopeB/* won’t be intercepted!

If we have to host the Service Worker on the customer’s domain, they would have to deploy a new version during our integration and for each update to the Service Worker’s script file, and that’s not very elegant. So, somehow, we have to maintain a www.customer-domain/service-worker route without hosting it on the customer’s server. How can you do that?

CDN for the rescue

Let’s think about what component might be in the middle between the client and the server… CDN! We will intercept the request in their CDN, and if a request is sent for /service-worker, we will simply return the Service Worker script from our servers!

For this purpose, we created an App in Cloudflare’s App-store that adds a Cloudflare worker (very similar to the way service workers work) to our customer’s app that does the exact thing. Now our customers can add our Service Worker in just one click.

Registering a service worker from cross-origin URL

In case our customer doesn’t have a CDN (get one…), we ask for just one route which returns this simple line importScripts(“https://echo.loadmill.com/lm-worker-script”) that downloads our worker.

Adding our custom route to your application takes a few more minutes, but still, a pretty simple way to integrate our service worker.

The second challenge — AWS & XHR

We chose AWS Kinesis Firehose as the recorder end-point to which we are sending all user’s transactions (transaction = request + response). The easiest way to work with AWS should be the AWS SDK, right? Well… it’s true as much as saying that AWS console has a good UX.

Our main problem was that AWS SDK uses the old XMLHttpRequest web API, which is not available in the service worker scope (it was deemed deprecated by the service worker spec team). We had to write some shim code to replace the XHR object with fetch under the hoods. AWS will switch to newer fetch API in the next major version (v3), which is now in the developer preview stage, but it is unclear when exactly it is going to be released.

The third challenge — Test generation

Correlations between requests

In order to replay what took place in production, we need to detect the relationships between the requests the user has made. We have to do so because, in most cases, you can’t just replay what happened in production.

Let’s consider the following scenario:

A typical scenario in production

The user has made a POST request, got a new item id in the response, and then he made another GET using that id. Simple right? 2 requests — POST to /items and GET to /items/123. Let’s replay it in the staging environment then.

Replaying production scenario as is in stage

This simple example won’t work. You can’t just blindly request for /items/123 since you don’t have an item with 123 ID in the stage environment. What you need to do, is to understand the relationship between the two requests and extract it to a parameter that will be dynamically evaluated based on the actual response value and used by the next request — a correlation.

Detecting a correlation and extracting it to a parameter

User data obfuscation

Obviously, it’s a bad idea to store private user data (Security-wise, GDPR, and good manners). We hash all of it in an irreversible way that will still let us keep track of the correlation between requests. You can read more about it here.

Conclusion (my 2¢)

I won’t lie to you; writing a Service Worker was harder than I expected. There were different behaviors between different browser types, and while looking for documentation across the web, we ran into a lot of outdated/incomplete information. We found ourselves asking a StackOverflow question about some issues that seem trivial for the documentation but couldn’t find one.

I feel like the service-worker ecosystem is still on its bleeding-edge phase, and to be honest, I think that by now, an awesome tool like that should more developer-friendly. I know that many people are working hard to improve this. Hopefully, we’ll get there soon.