This means that our users are not only getting the top deals and prices out there, but they’re also able tap into our collective knowledge to find inspiration for their trips. We compute these recommendations daily and personalize them based on the user location.

Historically the discovery feeds on this screen have been powered by our ‘Everywhere’ search allowed users to explore the best deals in the market. With the recommended feed, we took a slightly different approach, setting price aside and focusing on where our users are actually going. So far, since launching, we’ve seen an increase of over 5% on our conversion rate.

This post will give an overview over the technical architecture we built to deliver destination recommendations.

Requirements

Let’s first have a look at the platform requirements to support serving these recommendations. From a high-level perspective, there are three points that are core to the problem. We need to:

Store a lot of historical data that can be used to train and test our models;

Process large amounts of data;

Serve resulting recommendations to users with minimum latency.

But when we start digging a bit deeper, there’s even more that we need to care about:

Rapid experimentation — we rely on user feedback to validate our assumptions, so we need to be able to easily plug and play new algorithms and get results from the opposite end of the pipe;

Minimal operational cost — we’re product focused, so we can’t afford to spend too much of our time running manual jobs or maintaining infrastructure.

Architecture

With these requirements in mind, we arrived at a two-tiered architecture, that you can see in the image below.

Two-tiered architecture behind Skyscanner’s Recommended Destinations

At the very bottom we have the ‘Logs Data Store’ which stores every user event logged by any Skyscanner team through our data platform. Our batch processing layer runs offline and takes advantage of this data store by consuming its data and running it through a series of algorithms. When it’s done, it feeds the fresh recommendations to a live data store. This data store is the base of a low latency web API that serves the computed data to our users.

One of the main advantages of having a clear-cut separation between batch and live is that it gives a lot of space to experiment with new algorithms and datasets. As long as the results conform to a predefined schema, we know that they can be served by the API.

Because we want to keep operational costs low, this architecture was built to run on the AWS cloud. The Logs Data Store is based on S3 and is maintained by our data management teams.

Deep Dive on Batch Processing

The batch processing layer is where we do all the data transformations that generate the recommendations that we serve through the live service.

The batch processing layer is where we do all the data transformations that generate the recommendations that we serve through the live service.

We use AWS DataPipeline to orchestrate this work. The DataPipeline fits our use case pretty well and helps bringing reliability to our batch jobs.

Attached to our DataPipeline we have a schedule of when we want our batch job to run. This is not fancier than a normal cron job. But it’s pretty handy to have it integrated with the whole system and fully managed by AWS.

We also attached a set of preconditions. These help to make the pipeline more reliable. We use this feature to make sure that we have all the data we need before starting the process. This is especially important when some of the input is the output of a different batch job. Preconditions can also be configured to wait for a certain period of time, so even if some condition is not satisfied at the scheduled hour, it doesn’t necessarily mean the job will fail, it might just execute a little bit later. It also means that, in case the preconditions are never met, the job will fail before provisioning any hardware, so we don’t incur any unnecessary costs.

When all preconditions pass, DataPipeline will provision an EMR cluster. EMR is an AWS hosted version of Hadoop YARN flavored with your choice of distributed data processing system. We chose Spark. There are two main reasons why we opted for Spark: it’s very fast, in great part due to its in memory caching, and it’s also very developer friendly. We’re particularly fond of Spark SQL and the DataFrame abstraction. Also, PySpark was a great selling point for us.

Most of the code we write runs as a Spark job. Even if the details can get a bit trickier, from a high-level point of view the code works as the image below.

Read Clean Rank Write

Deep Dive on Live Service

We’ve looked at how we produce recommendations; let’s have a look now at how we serve them. As you may have guessed, this layer is very thin and we do our best to keep it as light and simple as possible.

For our Live Data Store we decided to use Postgres. We chose it because:

It is available in RDS, AWS’s managed relational databases with out of the box multi-region replication which means minimal maintenance work for us

It has very low latency (we measure consistently below 5ms per query)

It’s easy to scale for read-heavy workloads

It’s relatively painless to adapt the schema and queries as the product matures

In front of the data store there’s a web service that serves a REST API. At Skyscanner we believe in microservices and recently made a move towards Backends for Frontends. This means our mobile clients don’t have to query this API directly, they do it through a BFF service that aggregates all the data they need to render the final UI. This particular architectural decision makes our job at this layer much easier. It means we can provide a very minimalistic API that another internal service will consume. This way we maximize reusability and enable ourselves to experiment with many different UI’s without changing the contract.

Conclusions

We’ve been running this in production for a while. So far, it’s been mostly painless but not without its fair share of learnings and iterations. The flexibility of writing new algorithms in the batch layer and having them immediately served live has already paid off and we’re continuously iterating on those. On the other hand, the maintenance work required to maintain the DataPipeline was perhaps a bit underestimated, as we ended up spending quite a bit of time going around many caveats from the service.

One of the challenges we expect for the future comes with scaling the size of the output from the batch processing layer. The size of the output is proportional to the personalization factor we use. If, for example, we start computing recommendations that are tailored for each Skyscanner user individually, the size of the output will multiply by the size of our user base. Our current strategy of replacing a dataset on a live data store may not hold at this scale and at some point we will need to revisit that architectural decision.

There’s also the question of real-time models. Currently, everything is done in batch, but there may be products that require fresher results. At Skyscanner, we already use Kafka for streaming data and Samza for some real time processing. So that side of the infrastructure won’t be the hardest part. But we may need to revisit our choice of Postgres for the live data store and perhaps switch for something that’s easier to scale on write-heavy workloads.

Work with us

We do things differently at Skyscanner and we’re on the lookout for more Engineering Tribe Members across our global offices. Take a look at our Skyscanner Jobs for more vacancies.

About the Author

Hello, I’m André and I work as a Software Engineer at Skyscanner. I’m part of the Loyal and Frequent Travelers Tribe where we focus on improving the experience for the kind of travellers who know the cabin crew by their first names. I’m passionate about delivering products that add value to people’s lives and grow them to world scale.