by Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, Nils Pohlmann

Appears in KDD 2013 Aug 2013, Chicago, IL. Paper (PDF). Talk (PDF)

© 2013. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at KDD’ 13, August 11 – 14, 2013, Chicago, IL, USA. http://dx.doi.org/10.1145/2487575.2488217

Abstract

Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural/organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are millions of live variants of the site, so alerts are used to identify issues rather than relying on heavy up-front testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders’ early excitement, saving us similar large amounts.

What people wrote

Paras Chopra, founder of Wingify (Visual Website Optimizer) in LinkedIn’s A/B testing group

Ronny, this is very interesting. I’m sharing your paper with our entire team at Visual Website Optimizer. I’m sure we will have lots of things to learn from your team. 200 concurrent experiments on a single product is simply incredible

What? Me? On the Bing blog? Pimping Bing? Yes! Bing’s cool experimentation platform — Avinash Kaushik Tweet

The following quote about this paper was approved for sharing

Microsoft’s upcoming KDD paper lovingly demonstrates Microsoft’s impressive scale at deploying controlled experiments to create an organization that thinks smart and moves smart! — Avinash Kaushik, Author of Web Analytics 2.0, Web Analytics: An Hour a Day, and Occam’s Razor blog

The following quote about this paper was approved for sharing

Awesome paper, the focus on customer experimentation—at scale—is an impressive example of customer development and evidence-based innovation. — Steve Blank, a consulting associate professor at Stanford, and Author of the Four Steps to the Epiphany and the Startup Owner’s Manual.

The following quote about this paper was approved for sharing

Deploy all software as an A/B test, assess the value of your ideas, test everything. Years of experience went into Microsoft’s KDD paper. Go read it. — Greg Linden, Entrepreneur, Blogger, Geeking with Greg

Kissing frogs and ugly babies: Bing’s experimentation platform by Kip Kniskern

Like many companies, Microsoft is making more and more use of “big data”, both for its own purposes and as business intelligence services it’s selling to other companies..It’s pretty clear that the era of big data is upon us, and the “old days” of beta testers and manual feedback are becoming a thing of the past

…the authors laid out 3 “Testing Tenets” that they felt were instrumental to the development (and scaling) of their testing program. These simple rules were fantastic to read and 100% relevant to the world of Marketing Optimization and testing! I enjoyed them so much that I’m going to summarize them for you, with commentary on why you might want to steal them 🙂

This paper is the basis for Bing’s official blog on 8/8/2013: Large Scale Experimentation at Bing

————————————————

BibTex:

@inproceedings{Kohavi:2013:OCE:2487575.2488217,

author = {Kohavi, Ron and Deng, Alex and Frasca, Brian and Walker, Toby and Xu, Ya and Pohlmann, Nils},

title = {Online Controlled Experiments at Large Scale},

booktitle = {Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining},

series = {KDD ’13},

year = {2013},

isbn = {978-1-4503-2174-7},

location = {Chicago, Illinois, USA},

pages = {1168–1176},

url = {http://doi.acm.org/10.1145/2487575.2488217},

doi = {10.1145/2487575.2488217},

acmid = {2488217},

publisher = {ACM},

address = {New York, NY, USA},

keywords = {a/b testing, controlled experiments, randomized experiments}}

ACMRef: Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’13), Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (Eds.). ACM, New York, NY, USA, 1168-1176. DOI=10.1145/2487575.2488217 http://doi.acm.org/10.1145/2487575.2488217