Building websites with science

Posted by Peter Seibel on June 21, 2012

At Etsy, we want to use data to inform what we do. We have an awesome culture of obsessing over graphs and building dashboards to track our progress. But beyond tracking the effects of our changes from minute to minute via graphs on big screens—which is a great way to see if a change has had some dramatic effect on, say, performance or error rates—there’s another, slightly slower paced way to use data: we can gather data over days or even weeks and then analyze it to determine whether the changes we have made to the web site have had the effects we expected them to.

There are different levels of awareness of data that different companies and teams within one company can have. Here at Etsy our search team, perhaps because of the inherently data-centric nature of their product, has been one of the leading users of data-driven development but the techniques they use can be applied to almost any kind of product development and we’re working to spread them throughout the company. In this post I’ll describe some of the different levels and finish up with some caveats about using science in product development.

Data? What’s that?

If you are building websites in the least scientific way possible your workflow probably goes like this: Think of a change we might make to the website (“I think I’ll change the background color on the Valentine’s Day page from white to pink”). This may be based on a theory about why that change makes sense (“It’ll put people in a more festive mood and they’ll buy more stuff”) or it may just “feel right”. In either case, you make the change and move on to your next idea. Unfortunately, people often disagree about whether something’s a good idea—everybody has an opinion and without any data there’s no way to escape the realm of pure opinion. The mark of a group operating at this level of no data use is the large amount of time people spend arguing about whether changes are worthwhile or not, both before and after the changes are made.

You know, we’ve got these web logs

Of course nobody really builds websites with quite this level of obliviousness since no one can resist the temptation to look at web logs and make up a story about what’s going on. A team operating at the next level of data awareness would think up an idea (pink background), make the change, and then look at the web logs to see how many people are buying things from the new, pink Valentine’s Day page. This is an improvement over the first level—if things change, at least you’ll know—but not much of one. The problem is that even though the data may tell you whether things are getting better or worse there’s almost no way to know whether the change is due to the pink background or some other factor. Maybe it’s getting closer to Valentine’s Day so people are more likely to buy Valentine’s Day related products. Or maybe over time, the user base has shifted to include people who are more likely to actually buy things. In other words there are any of a number of “confounding factors” that may be causing sales to go up or down, independent of the changed background. The mark of a team at this level is that the same amount of time is spent arguing but now the arguments are about how to interpret the data. Folks who think the change was good will argue that improving sales are the result of the change and declining sales due to something else while folks who don’t like the change will argue the opposite. But the arguments take just as long and are just as inconclusive.

Hey kids, let’s do “science”!

The obvious next step, for many people, is to try to control for the many confounding factors that result from the passage of time. Unfortunately there are a lot of wrong ways to do that. For instance, you might say, “Let’s use the pink background only on Thursdays and then compare the purchase rate on Thursday to that of rest of the week.” While this is a step in the right direction, and is probably easy to implement, it is not as useful as you might think. While you have controlled for some confounding factors, there are still many others that could still be intruding. Any difference that could possibly exist between the folks who visit on Thursday and those who come during the rest of the week could provide an alternate explanation of any difference in purchase rates. Just because there’s no obvious reason why Thursday visitors would be richer or more profligate or just more into Valentine’s Day than the average user, they might be and if they are it will confound the results. At this stage there are likely to be fewer useless arguments but there is a quite high chance of making bad decisions because the data don’t really say what folks think they say.

That’s so random

Luckily there’s one simple technique that neatly solves this problem. Instead of dividing visitors by what day they visit the site or some other seemingly arbitrary criteria, divide them by the one criteria which, by definition, is not linked to anything else: pure randomness. If every user is assigned to either the pink or white background at random, both groups should be representative of your total user base in all the respects you can think of and all the ones you can’t. If your total user base is 80% women, both groups will be approximately 80% women. If your user base has 1% of people who hate the color pink, so will both your experimental groups. And if your Thursday users, or any other subset, are for whatever reason more likely to buy than the average user, they will now be distributed evenly between both the pink and white background groups.

Another way to look at it is that if you observe a difference in purchasing rate between two randomly assigned groups then you only have to consider two possibilities: that the difference is actually due to the background color or that it’s due to random chance. That is, just as it’s possible to flip a coin ten times and get eight heads and two tails, it’s possible that the random assignment put a disproportionate number of big spenders in one group and that’s why the purchasing rates differ. However statistics provides tools that can tell you how likely it is that an observed difference is due to chance. For example, statistics can tell you something like: “If the background color in fact had no effect on purchasing rate, there’s a 0.002% chance that you would have seen the difference you did due to the random assignment of users to the two groups.” If the chance is small enough, then you can be quite confident that the only other possible explanation, that the background color had an effect, is true.

This is what data geeks mean when we say something is statistically significant: results are statistically significant when the chance of having achieved them by pure chance is sufficiently small. How small is sufficiently small is not fixed. One common threshold is 95% confidence which means that the chance of getting the results by chance has to be less than 5%. But note that at that threshold, one out of every twenty experiments where the change actually has no effect, will get a result that passes the threshold for statistical significance leading you to think the change did have an effect. This is called a false positive. You can control how many false positives you get by how you choose your threshold. With a higher confidence level, say 99%, you will have fewer false positives but will increase your rate of false negatives—times when the change did have an effect but the measured difference isn’t great enough to be considered statistically significant. In addition to the confidence level, the likelihood of a false negative depends on the size of the actual effect (the larger the effect, the less likely it is to be mistaken for random fluctuation) and the number of data points you have (more is better).

There are, however, two things to note about statistical significance. One is that there is no law that says what confidence level you have to use. For many changes 95% might be a good default. But imagine a change that is going to have a significant operational cost if rolled out to all users, maybe a change that will require expensive new database hardware. In such a case you might want to be 99% or even 99.9% confident that any observed improvement is real before deciding to turn on the feature for everyone.

The second thing to keep in mind is that statistical significance is not the same as practical significance. With enough data you can find statistically significant evidence for tiny improvements. But that doesn’t mean the tiny improvements are practically useful. Again, consider a feature that will cost real money to deploy. You may have highly statistically significant evidence that a change gives somewhere between a .0001% and .00015% improvement in conversion rate, but if even the top end of such a small change won’t produce enough new revenue to outweigh the cost of the feature you probably shouldn’t roll it out.

Any group that understands these issues and is running randomized tests is truly using data to understand the effects of their changes and to guide their decisions. The marks of a team operating at this level are that at least some people spend their time arguing about statistical arcana—what is the proper alpha level, Bayesian vs frequentist models, and exactly what statistical techniques to use for hypothesis testing—as well as coming up with more sophisticated statistical analyses to glean more knowledge from the data gathered.

Real Science™

Note, however, that no matter how fancy the analysis, the only knowledge built is about the effect of each particular change. Each new change requires starting all over—coming up with an idea, testing it with randomly assigned groups of users, and analyzing the data to see if it made things better or worse.

Which may not be too bad. If you have reasonably good intuition about what kind of changes are likely to be improvements, and you aren’t running out of ideas for things to do, you may be able to get pretty far with that level of data-driven development. But just as using data to decide what changes are actually successful is often better than relying on people’s intuition, you can also use data to help improving people’s intuition by by providing them with actual knowledge about what kinds of things generally do and do not work. To do this you need to follow the classical cycle of the scientific method:

Ask a question Formulate a hypothesis Use the hypothesis to make a prediction Run experiments to test the prediction Analyze the results, refining or rejecting the hypothesis Communicate the results Repeat

For instance, you might ask a question like, what kind of images make people more likely to purchase things on my site? One hypothesis might be that pictures of people, as opposed to pictures of inanimate objects, make visitors more likely to purchase. An obvious prediction to make would be, adding a picture of a person to a page will increase the conversion rate for users who visit that page. And the prediction is easy to test: pick a page where it would make sense to add a picture of a person, do a random A/B test with some users seeing the person and others some other kind of picture, and measure the conversion rate for that page.

If the results of the experiment are in line with your prediction that’s good but you’re not done—next you want to replicate the result, maybe on a different kind of page. And you may also want to test other predictions that follow from the same hypothesis—does removing pictures of people from a page depress the conversion rate? You can also refine your hypothesis—perhaps the picture must include a clearly visible face. Or maybe you’ll discover that pictures of people increase sales only in certain contexts.

Whatever you discover you then need to communicate your results—write them down and make it part of your institutional knowledge about how to build your web site. It’s unlikely that you’ll end up with any equations as simple as E = mc2 or F = ma that can guide you unerringly to profitability or happier users or whatever it is you care most about. But you can still build useful knowledge. Even if you just know things like, “in these contexts pictures of people’s faces tend to increase click through” you can use that knowledge when designing new features instead of being doomed to randomly wander the landscape of possible features with only local A/B tests to tell you whether you’re going up or down.

The limits of science

Science, however, cannot answer all questions and it’s worth recognizing some of its limitations. Indeed science is neither necessary nor sufficient: Jonathan Ive at Apple, working with his hand-picked team of designers behind the tinted windows of a lab from which most Apple employees were barred, didn’t—so far as we know—use A/B testing to design the iPhone. And Google, for all their prowess at marshaling data, has not been able A/B test their way to a successful social networking product.

Here are a few of the practical problems with trying to design products purely with science:

A/B testing is a hill climbing technique If you imagine all the variants of all the features you could possibly build arranged on a landscape, with better variants at higher altitude, then A/B testing is a great way to climb toward better versions of an existing feature. But once we’re at a local maximum, every direction leads downhill. The only way off a local maximum is to leap into space and try to land (or hope that you do) somewhere on a gradient that leads to a higher peak.

If you do leap off toward a new hill, you can’t use A/B testing to compare where you land to where you were because most likely you’ll land somewhere below the peak of the new hill and quite possibly lower than where you were on the old hill. But if the new peak is actually higher than the old one, you’d do well to stay on the new hill and use A/B testing to climb to the higher local maximum. Once we’ve made the leap, science and experiments definitely come back into play but they’re of limited use in making the leap itself.

A/B testing is only as good as our A and B ideas Another way to look at A/B testing is as a technique for comparing ideas. If neither the A nor B idea is very good, then A/B testing will just tell you which sucks less. Knowledge built with experiments, will—we hope—help designers navigate the space of possible ideas but science doesn’t say a lot about where those good ideas ultimately come from. Even in “real” science exactly how scientists come up with interesting questions and good hypothesis is a bit mysterious.

You can’t easily learn about people who aren’t using your product The basis for statistical inference is the sample. We learn about a larger population by measuring the behavior of a random sample from that population. If we want to learn about the behavior of current Etsy users we’re all set—run an experiment with a random sample of users and draw our inferences. But what if we want to know whether a new feature will expand our user base? It’s harder (though not necessarily impossible) to get a random sample of people who aren’t currently using Etsy but who might be interested in some as-yet-undeveloped Etsy feature. Consequently, there’s a real risk of looking under the streetlamp (running experiments on our current users) because the light is better, rather than looking somewhere where we might actually find the keys to a hot new car.

Finally there’s one larger structural problem with the scientific method captured by a famous paraphrase of something Max Planck, founder of quantum theory and the 1918 physics Nobel laureate said: “Science progresses one funeral at a time.” In other words, it can be difficult to let go of hard-won scientific “truths”. After you’ve gone to all the trouble to ask an interesting question, devise a clever hypothesis, and design and run the painstaking experiments that convince everyone that your hypothesis is correct, you don’t really want to hear about how your hypothesis doesn’t fit with some new bit of data and is bound to be replaced by some new theoretical hotness. Or as Arthur C. Clarke put it in his First Law:

When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.

In product development, instead of distinguished but elderly scientists we have experienced product managers. But we have the same risk of holding on too long to hard-won “scientific” truths. If we do everything right, running experiments to test all our hypothesis about how the product works and even making occasional bold leaps into new territory to escape the trap of A/B incrementalism, eventually accumulated “scientific” knowledge may become a liability. And the danger is even worse in product development than it is in “real” science since we’re not dealing with universal and timeless truths. Indeed, each product you launch (to say nothing of what the rest of the world is doing) changes the landscape into which new products will be launched. What you learned yesterday may no longer apply tomorrow, in large part because of what you do today. Properly applied, science can lead us out of such difficulties but it’s worth noting that it was science, by allowing us to know things at all, that can mislead us into thinking we know a bit more than we really do.

So what to do?

Science is powerful and it’d be silly to abandon it entirely just because there are ways it can go wrong. And it’d be equally silly to think that science can somehow replace strong, talented product designers. The goal should be to combine the two.

The strength of the lone genius designer approach (c.f. Apple) is that when you actually have a genius you get a lot of good ideas and can take giant leaps to products that would never have been arrived at by incremental revisions of anything already in the world. Furthermore, with a single designer imposing their vision on a product, it is more likely to have a coherent, holistic design. On the other hand, if you give your lone genius completely free rein then there is no protection from their occasional bad ideas.

The data-driven approach, (c.f. Google), on the other hand, is much stronger on detecting the actual quality of ideas—bad ideas are abandoned no matter who originally thought they were good and the occasional big wins are clearly identified. But, as noted above, it doesn’t have much to say about where those ideas come from. And it can fall into the trap of focusing on things that are easy to measure and test—don’t know what shade of blue to use? A/B test 41 different shades. (An actual A/B test run at Google.)

Ultimately the goal is to make great products. Great ideas from designers are a necessary ingredient. And A/B testing can definitely improve products. But best is to use both: establish a loop between good design ideas leading to good experiments leading to knowledge about the product leading to even better design ideas. And then allow designers the latitude to occasionally try things that can’t yet be justified by science or even things that my go against current “scientific” dogma.