Web Performance vs. User Engagement

How Vrbo™ correlates business events to performance data

Porsches are beautiful, engaging and fast — three qualities a good website should have. Photo by Francesco Lo Giudice on Unsplash

In this story, we will look at how Vrbo™ (an Expedia Group™ company) implemented an automated process to correlate business events to performance data. I hope it will inspire you to do the same.

But first we need to cover some basics. We all know that faster sites convert more customers. But it’s one thing to look at data from others and trust the same will apply to your website. It’s another thing to do your own research. When you analyze your own data, it suddenly becomes more meaningful.

The holy grail of performance monitoring is finding how performance correlates to conversion.

Disclaimer: If you are reading this, it probably means you have some interest in web performance, and most likely you don’t need convincing about the benefits of having a fast website. So I’ll skip the part where I try to convince you, but just in case you need some convincing, here is an excellent article from Google, or hundreds of case studies from WPO stats.

How do we measure site performance?

Skip this chapter if you are familiar with RUM and user-centric performance metrics.

RUM

Real user monitoring (RUM) is a passive monitoring technology that records all user interaction with a website or client interacting with a server or cloud-based application (source: Wikipedia). We use it as a gauge to track whether we are actually making improvements in the real world.

RUM events at Vrbo number in the hundreds of millions per day. Deriving performance insights in a reasonable amount of time from such a voluminous raw dataset is difficult, as you can imagine. Therefore, we preprocess RUM data into an aggregate summary statistic to make querying the data in real-time a possibility. This aggregate data (and subsequent derivations) are what populate our dashboards.

Spark job runs daily and aggregates data into summary statistics. Dashboards consume aggregated data to populate graphs

We also use synthetic monitoring at Vrbo, but this story is focused on RUM data.

Performance metrics

I participated at numerous meetings at Expedia Group™ where the following question was asked: What performance metrics should we track? The main issue is not the availability of performance metrics, but one that is meaningful to your website can be confusing and overwhelming.

“You make what you measure, so measure carefully”

I’m unsure of the original source of this quote, but I saw it for the first time in the book Chaos Monkeys by Antonio García Martínez. I find this quote perfectly encapsulates performance monitoring. It reminds us to carefully choose the performance metrics we track, because those are the metrics that will most likely improve.

Below are performance metrics deemed most meaningful to Vrbo. The details of how we went about selecting these metrics is somewhat complicated, but was mainly driven by the perceived relationship between a metric and real user experience:

PAR (Primary Action Rendered): PAR is a custom metric that measures how long it takes for the most important feature of the page to be rendered to our end users. Since this is a Vrbo-specific metric, we will avoid using it in the examples of this blog post, but it’s one of our most important metrics.

(Primary Action Rendered): PAR is a custom metric that measures how long it takes for the most important feature of the page to be rendered to our end users. Since this is a Vrbo-specific metric, we will avoid using it in the examples of this blog post, but it’s one of our most important metrics. FCP (First Contentful Paint): Definition from Google: First Contentful Paint (FCP) measures the time from navigation to the time when the browser renders the first bit of content from the DOM. This is an important milestone for users because it provides feedback that the page is actually loading.

How quickly can the user see something?

(First Contentful Paint): Definition from Google: First Contentful Paint (FCP) measures the time from navigation to the time when the browser renders the first bit of content from the DOM. This is an important milestone for users because it provides feedback that the page is actually loading. How quickly can the user see something? FID (First Input Delay): Definition from Google: First Input Delay (FID) measures the time from when a user first interacts with your site (i.e., when they click a link, tap on a button, or use a custom, JavaScript-powered control) to the time when the browser is actually able to respond to that interaction. This metric helps us understand if we are using too much CPU while rendering the page.

(First Input Delay): Definition from Google: First Input Delay (FID) measures the time from when a user first interacts with your site (i.e., when they click a link, tap on a button, or use a custom, JavaScript-powered control) to the time when the browser is actually able to respond to that interaction. This metric helps us understand if we are using too much CPU while rendering the page. TTFB (Time to First Byte): TTFB (aka `Backend Time`) is not necessarily a user-centric metric, but we still follow it closely because it allows us to isolate backend vs. frontend changes.

(Time to First Byte): TTFB (aka `Backend Time`) is not necessarily a user-centric metric, but we still follow it closely because it allows us to isolate backend vs. frontend changes. FPS (Frames Per Second): Frame rate is most familiar from film and gaming, but is now widely used as a performance measure for websites and web apps. It’s commonly used as a synthetic metric (e.g. Mozilla and Google) but it can also be measured from RUM. We implemented our own optimized utility to measure FPS in a browser environment and we are currently in the process of open-sourcing this code.

These are not the only metrics we measure. They’re just the ones we follow very closely.

Performance regions

“My site loads in 5 seconds”

How many times have you heard something similar? First of all, we need to define what “load” means. Is that when the first pixels are painted (FCP) or when the load event was fired? Second, what kind of measurement was used? Is that 5 seconds average (mean), tp50 (median), tp90, or what?

Also, looking at single statistics is highly problematic. Here are a few examples from Rico Mariani’s post about Understanding Performance Regions (which served as a major inspiration for the work behind this blog post):

“Mean: You can commit any crime of variability and keep the mean constant.” “P90: If you report only P90 you can commit any crime you like before or after the P90 as long as that one point stays fixed you’re fine. For instance if the best 50% all got somewhat worse that wouldn’t affect the P90 at all. Or if you moved the P90 down at the expense of say, P25 and P50, that isn’t good.” “P50: Ibid mostly… improvements in the top half do not register, nor does worsening of the back half.”

Due to the problems highlighted above, at Vrbo we try to look at the distribution as much as possible: