Machine Learning @ Teads — Part 2

Stack, production workflow and practice

In the previous post we talked about why we use Machine Learning at Teads and which particular use cases we work on. In this article we will be covering which technologies we use, why we had to build new solutions and our ML production workflow. We will end with how we actually enhance our ML practice.

ML Stack and why we do not use MLlib

Our stack leverages existing technologies, with Apache Spark sitting at the center. Our training jobs are scheduled by Jenkins, their dependencies are handled by Coursier, and they run on AWS Elastic Map Reduce with training logs and resulting models stored on S3. We also use Jupyter notebook for analysis.

Our specific use cases require to be able to work with large streams of ad delivery events such as bid requests, impressions or complete views and continuously update our models. In this context, using MLlib seemed to be the right solution.

MLlib is a native component of Spark dedicated to Machine Learning applications. It provides tools to build ML pipelines as well as common learning algorithms. That being said, MLlib had major limitations for us:

The logistic regression and clustering implementations in Spark use DenseVectors which is incompatible with high-dimensional sparse data like ours.

In order to avoid any discrepancy, we needed to use the same library for offline trainings and online predictions.

Building our own prediction library

To get over MLlib’s limitations we decided to build our own library. The runtime is still based on Spark and our library acts as an abstraction layer between Spark and underlying implementations from Breeze and embeds our custom algorithms.

This library is part of a more general prediction framework that enables to test new experimental approaches and guarantees that the same code is used both online and offline.

Here is our Machine Learning workflow, we will get into offline/online details later on:

Machine Learning Workflow — Credit JU Han

The first step [1] is the actual service making predictions. The generated application logs are then used, together with other sources of data (DMPs, etc.) to build training sets [2].

For the training jobs [3], these data sets are randomly split into several partitions that are balanced to avoid hotspots. Following this, we manipulate each partition to clean and transform the features before vectorizing them using the Hashing Trick. For a model with 20 features, the resulting sparse vector can activate up to one million different indices.

Results are then stored in local matrices used during iterations of the optimization algorithms: L-BFGS for logistic regressions and Expectation-Maximization for clustering algorithms.

Focus on step [3] Training job — Credit JU Han

Training jobs are scheduled using Jenkins and generate a prediction model embedding various weights stored on S3. All these steps are done offline. Once ready, models are used by different online services (back to step [1]).

Online model testing using A/B tests

Evaluating complex web systems and their impact on user behavior is a challenge of growing importance. A/B tests aim at deciding which algorithm, which home page, which user interface, etc., provides the best results in terms of relevant criteria such as traffic and revenue.

Temptation is high to conjecture that a system will perform better than another. In fact, we are wrong most of the time when it comes to guess what users want or what they are the most reactive to. The only way to evaluate a new system is to test it in a statistically valid framework.

The use of A/B testing is now widespread in the industry. It compares two versions of a system, A and B, by splitting users randomly into two independent populations to which systems A and B are respectively applied. We use this method to evaluate the impact of our prediction models.

A/B Testing — icons from icons8

We created a small web app to setup and manage the A/B tests and keep track of each experiment. A simple web UI makes it easy to define parameters for different populations.

A/B Testing setup UI

We usually split the population into three and define one small population to be able to identify if there is a major issue with the experiment.

A/B test analysis

We automatically perform general analysis of the results and use notebooks for custom reviews. Special attention is paid to statistical significance, thanks to confidence intervals computation using the bootstrap method.

An important aspect of the analysis is that we observe business metrics like: margin, revenue, etc.

Example of completion rate increase for each day of an AB-test, with 80% confidence intervals in dotted lines

A/B tests are great because they are precise and realistic. However, performing an A/B test takes time (up to several weeks) and is hence costly.

Moreover, A/B tests have an impact on production, we cannot rely on them to select the best model among tens/hundreds of combinations.

Offline model testing

Since we cannot test all of our models online we also perform offline testing. An offline testing implies several steps, the first one being model training.

The second step is the model validation. In this step, we want to measure the predictive power of our models by computing validation metrics on a validation set.

This validation is made by iterations to be as close as possible to the production cycle, as illustrated in the following chart (3 iterations).

In particular, the validation metrics we use include the weighted Mean Square Error (MSE), using specific weights for each use cases.

For example, when we want to predict billable events, we weigh the results according to the amount to be paid when the event occurs. Thus, we focus on the MSE of the expected revenue (Cost per View or CPV) and not the MSE of a billable probability:

We created a tool, called Datakinator, to facilitate the creation of homogeneous experiments that all respect the same testing protocol. This tool also archive them to keep track of the results.

Using Datakinator, we are able to efficiently create tens of experiments without having to worry about weighing the MSE or defining the protocol. It also simplifies later comparisons between models.