Automate precaching resources

Posted: Apr 8, 2018

Nobody likes to wait. When a user clicks on a link, they want to get an immediate response. If it takes a while, the user might switch to another tab and completely forget about the opened site which is loading. In fact, it means lost customers for site owners. Progressive Web Apps are aimed to fix it by giving a response which might be close to native apps.

Native apps keep all resources on disks, so they don't need to download anything. Using new technologies shipped by browsers, we can serve resources from the browser's cache before the user requests them. Thereby, users might get the similar experience to native apps. The instant response keeps user's focus.

There are a few ways to precache resources:

In a first place, we might want to precache everything, but there are a few reasons not to do that:

Bandwidth usage Users on mobile devices might have limited Internet quota, so we need to be careful not to exhaust it. Also, please, don't forget about their battery.

Users on mobile devices might have limited Internet quota, so we need to be careful not to exhaust it. Also, please, don't forget about their battery. Load on backend Lots of requests from every user might kill your backend.

Lots of requests from every user might kill your backend. Disk space Browsers manage cached data, even if you cache the entire site, don't expect to see all resources in the cache, browsers might remove some to free disk space.

Browsers manage cached data, even if you cache the entire site, don't expect to see all resources in the cache, browsers might remove some to free disk space. Stale content Precaching resources is a good thing, but we want users to see up to date content.

Having considered arguments above, I started working on Sirko Engine which is aimed to be more accurate in precaching resources. Actually, Sirko Engine is only a part of the solution, there is also Sirko Client. Presently, the project has 2 big features:

precaching pages and serving them on request from the cache

accumulating cached pages to serve them in the offline mode

Below I am describing how the project technically works. If you want to know how to install it, please, refer to the installation guide.

To know which resources should be precached, the engine gathers information how users navigate on a site.

When the user visits the site, the client makes a request to the engine. That request includes the referrer, the current path and a list of urls to assets on the current page (JS and CSS).

{ "referrer" : "/home" , "current" : "/project" , "assets" :[ "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" , "https://demo.sirko.io/assets/css/style.css" , "https://engine.sirko.io/assets/client.js" , "https://demo.sirko.io/assets/app.js" , "https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js" , "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" , "https://demo.sirko.io/assets/project.js" ] }

A relation between the referrer and the current page is a transition, the transition has direction. Internally, every user is represented by a separate session which keeps several directed transitions made by a particular user.

However, most sessions don't look that simple, there might be transitions back.

It reminds a graph, actually, it is a graph.

The engine isn't interested in a particular user. So, if there is no more transitions from the user within 1 hour, the session get expired even though the user might come back later. Expiration is an important step, expired sessions commit to overall transitions from one page to another. The overall graph looks like this:

You might've found a node without a path, it is an exit node. All sessions connected to this node are expired.

Let's review each element on this graph.

The session relation keeps a number of transitions made by a particular user:

{ "occurred_at" : 1521176451536 , "count" : 2 , "key" : "a9f81dfdefc9dad197bead6d812f0468dee1c5fc7976dc1ded8b0ae1d0535bd5" }

During one session, the user might visit the same pages several times. To take it into consideration, the session relation keeps the count property. The occurred_at property is required to identify age of the session which is needed to expire inactive sessions and remove stale sessions (this step is described below).

The transition relation keeps the total number of transitions made by all users:

{ "updated_at" : 1521057839027 , "count" : 14 }

For example, we have 3 session relations between the /about and /contact pages, let's say each of those sessions keep 1 as a value in the count property. It means, the transition relation between them keeps 3 in the count property. The updated_at property keeps time when the relation has been updated last time.

The page node keeps details about a page:

{ "path" : "/home" , "assets" : [ "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" , "https://demo.sirko.io/assets/css/style.css" , "https://engine.sirko.io/assets/client.js" , "https://demo.sirko.io/assets/app.js" , "https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js" , "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" ] }

To predict pages the engine uses Markov Chain. The logic was described in my another article. Although, there are a few changes.

The start point was removed as an inessential thing which didn't add any value to the model.

Instead of predicting one page, the engine predicts several pages. The initial idea described in that article was designed for the prerender hint which could prerender only one page. Since the prerender hint isn't used by the project anymore, there is no one-page limitation.

The engine also predicts assets to pages.

After predicting pages, they pass a confidence threshold, only pages which pass it are precached. The confidence threshold is defined as a setting for the engine and helps to control load on the backend.

Once prediction is received from the engine, the client precaches resources via Cache Storage. Thus, when the user moves to another page, a service worker checks whether a requested resource is in the cache, if so, it is served from the cache, otherwise, it is normally loaded. After loading the page, the cached resources get removed from the cache to avoid serving stale content.

Cache invalidation is a challenge. When the page is opened, the user might submit data which might change the precached pages. Therefore, the service worker not only serves cached resources, but also keeps an eye on fired requests. For example, if there is a request modifying data, the transition won't be tracked.

GET /contact

GET https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css

GET https://demo.sirko.io/assets/app.js

POST /message

The service worker verifies requests between the referrer and the current page. In this example, the /contact page is a referrer and message is a current page. Obviously, the user inputed something on the contact page, thus, the service worker spots it and tells the engine not to track this transition, because the message page cannot be precached in this case. It even works for AJAX requests.

GET /messages

POST /messages (an AJAX requests made via JS on the messages page)

(an AJAX requests made via JS on the messages page) GET /home

The transition between the messages and home pages won't be tracked.

Besides expired sessions, there are stale sessions. All sessions which kept in the DB more than 7 days are stale. Stale sessions get removed and their counts get subtracted from the total number of transitions between pages. Sites get changed, thus, some pages might be removed. The idea behind removing stale sessions is to slowly fade transitions between pages which aren't used anymore, eventually, they disappear.

For example, these pages don't have session relations anymore (probably, these pages were removed from the site), there is only the transition relation which will be removed by the engine. The page nodes cannot stay lonely, so they get removed as well. This operation makes sure there is no garbage in the DB which might mislead the prediction model and inflate the DB.

Above I mentioned that precached resources get removed. Actually, it isn't quite true, the client moves them to a separate cache which is used when the user is offline.

sirko-prefetched keeps predicted resources.

keeps predicted resources. sirko-offline keeps all precached resources. Basically, when the user navigates to a next page, all resources from sirko-prefetched get shifted to sirko-offline , thus, this cache accumulates resources.

Only predicted resources are served in offline. However, there is a trick how the entire site can be cached to work offline, but it has costs.

The engine is written in Elixir. I've been working with Ruby for last 12 years. So, for this project, I wanted a language which keeps me productive as Ruby, nevertheless, it is also very fast, scalable and I don't have to fight with the language (this statement isn't about Ruby). It is easy to start working with Elixir. You just need to understand a few crucial things. A project is an OS where applications are libraries which concurrently work, every application has processes which are kind of objects in OOP languages, they also concurrently work. Processes have behavior and they might have state (very similar to objects in OOP, isn't it?)

The client is written in JavaScript. I chose Rollup to bundle it. Initially, I used Webpack, but after trying out Rollup, I discovered that Rollup compresses my JS code better. Anyway, I don't need most of Webpack plugins, my library is only about JS code.

Neo4j was chosen as DB. When your data structure is a graph, it makes sense to use a graph DB. Nodes and relations can have properties, it is very useful feature for my project. Also, a Cypher language is really powerful. When I can compute some stuff in the DB without fetching them I prefer to do that, thus, memory consumption stays low. In Ruby projects I work with ActiveRecord, it is a great library, but it hides advanced features of DBs. For this project, I decided that I want to have access to everything the DB gives to me.

I am trying to create a task here and there for each my idea, but there are only finalized tasks. There are still lots of thoughts which I write down outside of Github. Currently, I have on my head (the order is arbitrary, no priority):

Images The client doesn't gather urls to images, thus, they aren't precached. It might be critical for the offline work.

The client doesn't gather urls to images, thus, they aren't precached. It might be critical for the offline work. Subdomains There is no way to use the project on sites with subdomains.

There is no way to use the project on sites with subdomains. Other resource hints You might've heard of the dns-prefetch or preload hints which might be supported by the project too.

You might've heard of the or hints which might be supported by the project too. Better prediction model The current prediction model is very simple. For example, it doesn't consider screen sizes which might drastically affect how users navigate. The responsive design means the adjusted navigation, because of that users on mobile devices are ignored, the engine doesn't make any prediction for them.

Before working on new features, I would like to gather feedback about the idea of this project. If there is no enough interest, there is no point to adjust anything. So, please, leave feedback in comments.