Case Study

As a short case study we present an ML pipeline design for an advertising use case: Click prediction in an RTB (Real Time Bidding) setting. The goal of this ML pipeline is to gather data on inventory, users, and advertiser information, and to train ML models to predict the likelihood of someone clicking on an ad, at the time of auction. This is one of the core optimization parts of the Ad Tech business.

Life of an Ad Call from 50000 feet

RTB, in a nutshell, provides a way to determine the value of an ad call in real time. In the diagram above we give a 50,000 feet view of the life of an ad call. When an ad call happens, the publisher sends AppNexus a request for a bid for the tag/ad-slot. The impression bus (i.e. Impbus) carries these ad calls to multiple bidders, which in our case, use a machine learning model to determine the value of this ad call. This bid is transferred back to the Impbus that runs an auction to decide which creative/ad to serve along side the publisher content.

This pipeline faces many challenges and requirements:

Scalability to handle 400 Billion impressions seen per day and 10 Billion transactions per day at peak;

to handle 400 Billion impressions seen per day and 10 Billion transactions per day at peak; Cost Effectiveness to lead to a successful ML product;

to lead to a successful ML product; Fast Iteration to enable research teams to quickly explore new model types to improve predictive performance;

to enable research teams to quickly explore new model types to improve predictive performance; Dynamic Data Sources that are coming from world-wide streams of impressions/bid requests;

that are coming from world-wide streams of impressions/bid requests; Tight Model Serving Constraints due to bid requests’ volume and timeouts’ requirements (<100ms response time required per prediction).

ML System architecture

To perform this large scale machine learning task, many components work together to provide a reliable and efficient way for handling these ML product requirements.

ML System Architecture

A typical ML pipeline involves assembling train/test data sets from multiple data sources, performing feature engineering/transformations, training the ML models, evaluating and tuning the models as well as deploying, serving, and monitoring the models’ performance. The above system architecture shows the 4 main steps in the production ML pipeline.

We briefly highlight some of the components in terms of lessons learned described above.

ML Feature Data Warehouse vs. Just-in-time Feature Transforms

The above diagram shows the spectrum of options for managing ML feature data. On the far right hand side, we see the just-in-time feature transform method. This method provides the highest level of flexibility for researchers in terms of feature engineering. This can produce a high number of custom scripts to pre-process data sets and lead to pipeline jungles in some cases.

The method that we opted for was a unified ML feature data warehouse. By keeping the right level of granularity at the observation level in the ML feature data warehouse, research teams are able to use custom look-backs for both back and live testing without having to pre-process data sets in redundant ways. At the same time production jobs can amortize feature transformations using jobs with customizable frequency. This has a net effect of reducing ML pipelines jungles and new experiments can be performed with minimal data pre-processing effort.

Model Training and Hyper-Parameters Tuning

In terms of model training, logistic regression models produce sparse models which help with reducing prediction time and improving interpretability. In addition, we provided both Python and Java/Scala libraries for the model training component to enforce research-prod parity.

Another component of the model training is hyper-parameters tuning. This tuning is technically expensive if left without monitoring and prior-guidance. To reduce our initial search space we pre-constrain the space by running offline experiments to determine adequate ranges for the hyper-parameters. This helps tune the models freshness because reducing the training time allows for more frequent model updates if needed.

Model Serving

ML research does not traditionally take into account the constraints that come with model evaluation in real-time in the model serving infrastructure. In our RTB use case there exist multiple constraints that are introduced by the model serving layer. Bid responses have very tight response windows allowed (i.e. <100ms). Therefore, the model serving engine must at all costs reduce both the amount of time it spends on converting bid-request data to supported features and on calculating the score from the currently attached predictive models.

The above two constraints impact the design of the model candidates both in terms of model types allowed to be deployed and in terms of the features types/values selected. For example, requesting candidate feature transformations needs to be based on expected model predictive performance as well as runtime predictive performance. Adding different levels of data sizes generates different outcomes:

Adding a single integer id as feature: 😄

Full text of the ad context: 😅

Full video of the ad context: ❌

Feature Inclusion Loop

An example loop for feature inclusion is described above. The main message is that both predictive and runtime performances have to be taken into account when adding a new feature or feature transformation to the serving layer. This requires back testing new features and then evaluating them on live traffic before employing them in production.

Live Model Performance Monitoring, Alerting, and Safe Guards

Monitoring of ML outputs requires sophisticated methods and is usually ignored in traditional ML research. However, in real business situations

KPIs and model health are the first aspects that need to be tracked efficiently, and at the right level of aggregation/abstraction, while allowing deeper investigation for transient issues that are common in large scale distributed systems like ours.

The main lesson learned here is that different channels are needed for different stakeholders:

For Business metrics such as CTR, delivery rates and other business level KPIs, the product and engineering managers need high level aggregated dashboards that show trends and overall behavior of the product results.

such as CTR, delivery rates and other business level KPIs, the need high level aggregated dashboards that show trends and overall behavior of the product results. For Model-specific metrics such as prediction bias, train/test/validation log-loss, model parameter ranges and other metrics related to the fine-grained health of the ML pipeline operation, the data science team members and prod engineering owners need tools and dashboards that allow them to dig into the details of the pipeline. This can be achieved with a mix of low level data analysis tools that allow for product unit level analysis.

Conclusion

This post is a brief overview of the details of an ML pipeline that we contributed to at AppNexus. The full details are under a second-round review and should be published in a PMLR paper later in 2018. The paper contains other details and solutions related to this pipeline that can help future ML practitioners design their ML pipelines, including some of the lessons learned that we found useful for our work.

This work was presented at PAPIs Europe 2018.

View the video recording here here.

Acknowledgements:

We would like to thank Abraham Greenstein, Anna Gunther, Catherine Williams, Chinmay Nerurkar, Lei Hu, Megan Arend, Paul Khuong, Ron Lissack, Sundar Nathikudi, and Noah Stebbins for being active participants though the process of discussing, designing, developing, and implementing the AppNexus Click Prediction Machine Learning Pipeline.