Building the Contest

We needed these things to run the contest: motivation, a problem, a dataset, people, contest management tools and support.

Motivation

Running a contest takes effort: there has to be a good reason to do it. The AppNexus Data Science team creates machine learning products to assist online content creators (publishers) and advertisers with the billions of auctions transacted daily on our platform, and to keep our platform safe. Just some of these products include:

Reserve price optimization in real-time bidding (RTB) auctions (a “reserve price” is the minimum price that a seller will accept from a bidder in an auction)

Optimally allocating impressions between guaranteed contracts and RTB auctions

Bid Price Pacing (automatically adjusting bid prices to match targets)

Discovery (identifying which publisher inventory is best to show ads)

Filtering non-human and other invalid traffic and inappropriate domains

We help publishers and advertisers get the best value and outcomes for their money, but our Data Science team can only tackle so many projects a year. So we’ve been experimenting with different ways to engage our Engineering and Product teams, to expand the use of machine learning in our products and services and share responsibility for its evaluation. Side benefits from this include creating an internal pool of data science talent, and more widespread data literacy (e.g. having more people thinking from a data-driven point of view).

An internal contest adds a competitive element, and makes the learning process more fun. It can work on a problem that’s important to the company instead of working on generic examples (e.g. “cat v dog” classification), and give contestants familiarity with the tools used by the DS team.

Problem

We needed a problem where:

The problem is relevant to AppNexus, and ideas, models and algorithms for it are valuable to the company.

Data to analyze the problem can be easily extracted, is of reasonable size (100 MB to a few GB), and its analysis can be done on the participant’s company laptop.

We have metrics (and baseline benchmarks for those metrics) to compare contest submissions, so teams can be unequivocally ranked.

We chose Click-Through Rate (CTR) prediction: predicting the probability of a user click, given a set of auction features. This problem met the criteria above, and:

The Buy Side Data Science team has a lot of expertise on CTR prediction, with existing ML-based models in production.

We have recent data for train/test sets, metrics to compare submissions and benchmarks obtained by running the production models on this dataset.

There is a lot of online literature on this topic.

The problem has good pedagogic value: many different ML models can be applied to this problem, ranging from logistic regression and factorization machines to neural networks.

For metrics, participants would be provided with historical auction data to train their model(s) then use those models to make predictions on a test dataset. These predictions would then be scored by comparing them with actual click results.

Dataset

We used a sample of historical auction data from a specific advertising campaign conducted in September 2017. We created a training dataset consisting of 900,000 samples and a test dataset of 100,000 samples, with each sample representing a unique auction, with numerical and categorical features for the ad impression being auctioned, and a “click_label” field that reported whether the ad was clicked or not.

People

The contest was open to Appnexians at our Portland, Oregon office and was held over 6 weeks (October to December 2017). Three teams participated (a total of 9 Appnexians from about 40 employees at this location). We deliberately limited the pool of contestants so we could pilot the competition (to work out implementation details and demonstrate the value of a contest before scaling to the whole company), and help people one-on-one throughout the competition.

Contest management tools

We wanted to provide participants with a development environment where they could start their analysis without being bogged down by setup issues. We also wanted to make it very easy for teams to submit their predictions and see their scores and ranking instantly.

We used a Docker image for the development environment; this contained the datasets, Python and Jupyter notebooks (although participants could use any programming language to do their analysis, we encouraged the use of Python and Jupyter because they’re popular tools in the AppNexus Data Science community and beyond). We extended a basic Jupyter Docker image with packages we considered relevant for our problem, including Pandas, NumPy and SciPy for analysis, Scikit-Learn for modeling traditional ML models, Tensorflow and Keras for deep learning models, and Matplotlib and Bokeh for data visualization. When the Docker container starts, it launches a Jupyter server: participants can start using Jupyter notebooks from the browser, and install additional Python packages or other software from the Jupyter terminal window.

Kaggle provides companies with an ability to conduct internal ML contests for a fee, or external contests with publically-exposed datasets. We searched online for freely-available software for participants to submit their predictions and see their scores and ranking immediately after submission, but didn’t find any. So we decided to create and open source our own competition platform. We used Dash, a Python framework based on React and Flask that can be used to build analytical web applications, including web interfaces with dropdowns, sliders and graphs without writing any Javascript.

We scored each submission as soon as it was uploaded from the app. We used the Logarithmic Loss (log loss) metric to score submissions (this is a popular metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1). Similar to Kaggle contests, each submission received two scores: a public score and a private score. The public score was calculated using the log loss metric on a fixed 20% random sample of the test dataset and was displayed immediately upon submission. The private score was calculated using the log loss metric on the full test dataset, but was kept hidden from the participants.