I don’t think so. I began using this approach in 2004, so it’s been 15 years already. The power of these techniques is only getting stronger. The platforms used to run these experiments are becoming more widely available and more efficient. I expect that over time we’ll see more business decisions at more companies being made by doing experiments.

Experimentation is relatively expensive, but the gains we’re seeing are so significant that it’s usually worth it. At Pinterest, for instance, we’ve been running experiments looking at how well machine learning can detect content that goes against our community guidelines. We’ve built a new system that shows a 20% increase in the ability to detect that content. There is no way we’d have been able to get that kind of improvement without testing one approach against another, dozens and dozens of times. For example, we’ve developed new machine-learning technologies to identify and hide self-injury content, so there’s much less of this on Pinterest, and reports of this content from Pinners have decreased by 88% over the last year. If someone does report a Pin for self-harm, we’re now three times faster at removing it, which means fewer people see it. Experimentation helped us achieve that.

At Pinterest we have that debate internally quite a bit. To avoid letting experiments get in the way of a decision, we use a technique called “holdouts.” Let’s say your gut instinct is that a certain change to the site is going to have positive effects, and you feel very confident of that. Instead of waiting two weeks to do an A/B test, sometimes we’ll make the change right now for 99% of users but not make the change for 1%. Then we test to make sure the experience for the 99% really is exceeding the experience of the small group we’ve held out. That allows us to make a change immediately but still test to be sure our instincts are right.

I’m not sure we hire differently, but it does require a different kind of onboarding. Companies like Facebook, Google, and Pinterest are famous for their long onboarding processes. I have friends at Facebook, where new employees spend two full weeks going through data training so that everyone understands what kind of data is available, how to access it, and how to best use it to support decision-making. That kind of training requires a huge investment.

At Walmart, people still talk about Store Number Eight, which was the location Sam Walton used when he wanted to experiment with some new approach. The practice of selecting a small number of locations where you try out new ideas continues: Walmart has approximately 10 stores it designates for experiments, with at least one in each region. Experiments typically involve things like floor layout or interactive devices. When I was at the company we experimented with a store that had only self-checkout aisles and no cashiers. As you can imagine, in a physical store experimentation is much slower than it is in a digital environment. Walmart’s culture is also affected by its merchants, who have so much experience that they sometimes rely more on instinct than on data. That instinctual approach can be successful up to a point. But especially when you’re operating at scale and launching thousands of new products each day, as we did at Walmart.com, one person’s ability to understand every new item coming into a category and at what volume it’s going to sell in each region is limited. That task is better left to a computer.

The number one issue at those firms is that people don’t have access to the data. Organizations like to talk about data democracy, but there are barriers, such as privacy concerns. I get lots of questions from people in all kinds of industries who are skeptical: “Should I really allow the entire company to see all this data?” Data democracy requires an investment and a cultural shift, but the benefits you get from letting more of your company have access to your data are significant, because it unlocks better decision-making.

When I joined Pinterest, what impressed me most was that 65% of employees there had done a query in its big-data system in the previous six days. They included not just product engineers and executives but people in HR and on the legal team. In our meetings, if somebody asks a question, instead of guessing at an answer, people’s typical response is to flip open their laptops and begin looking through customer transactions to try to find a data-driven answer.

To succeed at it, people have to commit to making decisions based on data. For most established companies, that requires transformational change. In many organizations the senior person in the room, the subject matter expert, or the person with direct responsibility makes the decision unilaterally, often based on instinct. At companies that are data-driven, you are much less likely to hear someone say, “My guess is…” or “I bet that….”

King: I worked at eBay from 2001 to 2008, and during the second half of my tenure there experimentation platforms and A/B testing became a focal point. In the early days a lot of experiments focused on the company’s search engine. When I worked at eBay, there were 100 million items for sale on the site, and we were constantly trying to optimize what users were shown when they searched on various terms. The goal was to keep it relevant but also to create serendipity. On platforms such as eBay, Etsy, and Pinterest, you don’t want the search function to be too precise. You want to encourage exploration so that people will roam around and discover new things. So we would experiment with different search results and measure things like transactions, click rates, and exploration time to try to get the best mix.

Jeremy King has worked in technology for nearly three decades—and has spent much of the past 15 years helping companies use experimentation and data to improve decision-making. Now the senior vice president for technology at Pinterest, King spoke with HBR about the benefits of the experimental approach and the kind of culture that’s necessary to support it. Here are edited excerpts of that conversation:

Idea in Brief The Opportunity In an increasingly digital world, randomized, controlled A/B experiments are an extremely valuable way to create or improve online experiences. The Obstacle Culture—not tools and technology—prevents companies from conducting the hundreds, even thousands, of tests they should be doing annually and then applying the results. The Remedy Create an environment in which curiosity is nurtured, data trumps opinion, anyone can conduct a test, all experiments are done ethically, and managers embrace a new model of leadership.

In December 2017, just before the busy holiday travel season, Booking.com’s director of design proposed a radical experiment: testing an entirely new layout for the company’s home page. Instead of offering lots of options for hotels, vacation rentals, and travel deals, as the existing home page did, the new one would just feature a small window asking where the customer was going, the dates, and the number of people in the party, and present three simple options: “accommodations,” “flights,” and “rental cars.” All the content and design elements—pictures, text, buttons, and messages—that Booking.com had spent years optimizing would be eliminated.

Gillian Tans, Booking.com’s CEO at the time, was skeptical. She worried that the change would cause confusion among the company’s loyal customers. Lukas Vermeer, then the head of the firm’s core experimentation team, bet a bottle of champagne that the test would “tank”—meaning it would drive down the company’s critical performance metric: customer conversion, or how many website visitors made a booking. Given that pessimism, why didn’t senior management just veto the trial? Because doing so would have violated one of Booking.com’s core tenets: Anyone at the company can test anything—without management’s permission.

Booking.com runs more than 1,000 rigorous tests simultaneously and, by my estimates, more than 25,000 tests a year. At any given time, quadrillions (millions of billions) of landing-page permutations are live, meaning two customers in the same location are unlikely to see the same version. All this experimentation has helped transform the company from a small Dutch start-up to the world’s largest online accommodation platform in less than two decades.

Booking.com isn’t the only firm to discover the power of online experiments. Digital giants such as Amazon, Facebook, Google, and Microsoft have found them to be a game changer when it comes to marketing and innovation. They’ve helped Microsoft’s Bing unit, for instance, make dozens of monthly improvements, which collectively have boosted revenue per search by 10% to 25% a year. (See “The Surprising Power of Online Experiments,” HBR, September–October 2017.) Firms without digital roots—including FedEx, State Farm, and H&M—have also embraced online testing, using it to identify the best digital touchpoints, design choices, discounts, and product recommendations.

Harold Edgerton ©2010 MIT. Courtesy of MIT Museum

Harold Edgerton was known for his experiments with high-speed photography and used stroboscopic equipment to capture moments in time.

“In an increasingly digital world, if you don’t do large-scale experimentation, in the long term—and in many industries the short term—you’re dead,” Mark Okerstrom, the CEO of Expedia Group told me. “At any one time we’re running hundreds, if not thousands, of concurrent experiments, involving millions of visitors. Because of this, we don’t have to guess what customers want; we have the ability to run the most massive ‘customer surveys’ that exist, again and again, to have them tell us what they want.”

But in studying more than a dozen organizations and analyzing anonymized data on experiments from upwards of 1,000, I have seen that Booking.com, Expedia, and their ilk are the exception. Instead of running hundreds or thousands of online tests a year, many firms run no more than a few dozen that have little impact.

If testing is so valuable, why don’t companies do it more? After examining this question for several years, I can tell you that the central reason is culture. As companies try to scale up their online experimentation capacity, they often find that the obstacles are not tools and technology but shared behaviors, beliefs, and values. For every experiment that succeeds, nearly 10 don’t—and in the eyes of many organizations that emphasize efficiency, predictability, and “winning,” those failures are wasteful.

To successfully innovate, companies need to make experimentation an integral part of everyday life—even when budgets are tight. That means creating an environment where employees’ curiosity is nurtured, data trumps opinion, anyone (not just people in R&D) can conduct or commission a test, all experiments are done ethically, and managers embrace a new model of leadership. In this article, I’ll look at several companies that have managed to do those things well, focusing in particular on Booking.com, which has one of the strongest cultures of experimentation I have found.

Cultivate Curiosity

Everyone in the organization, from the leadership on down, needs to value surprises, despite the difficulty of assigning a dollar figure to them and the impossibility of predicting when and how often they’ll occur. When firms adopt this mindset, curiosity will prevail and people will see failures not as costly mistakes but as opportunities for learning.

A classic example concerns an incident at Amazon involving a revision of Air Patriots, a game for mobile devices in which players defend towers from attack with a squadron of planes. When Amazon launched a new version of it, the development team was taken aback by the response: The seven-day user-retention rate dropped by an astonishing 70%, and revenue fell 30%. The team discovered that it had inadvertently increased the game’s difficulty by about 10%. Amazon quickly shipped a fix, but the developers wondered if making the game easier could produce large gains in retention and revenue. To find out, they ran a test with four new levels of difficulty, in addition to a control, and learned that the easiest variant did the best. After some further refinements, Amazon launched a new version—and this time users played 20% longer and revenue increased by 20%. An accident had led to a surprising insight, which became the starting point for new experiments.

Unfortunately, this kind of reaction is an anomaly. At many companies the risk associated with experiments makes managers reluctant to allocate resources to them. But the gains enjoyed by companies that have made the leap of faith should give others the courage to follow them.

Many organizations are also too conservative about the nature and amount of experimentation. Overemphasizing the importance of successful experiments may encourage employees to focus on familiar solutions or those that they already know will work and avoid testing ideas that they fear might fail. And it’s actually less risky to run a large number of experiments than a small number. At Booking.com, only about 10% of experiments generate positive results—meaning that “B,” a modification that attempts to improve something (sales, repeat usage, click-through rates, or the time users spend on the site, for example), performs better among randomly assigned users than “A,” the control, which is the status quo. (In addition to A/B tests, Booking.com also runs more-complex tests that assess more than one modification at the same time.) But when you conduct a large volume of experiments, a low success rate still translates into a significant number of successes, which, in turn, diminish the financial and emotional costs of the failures. If a company does only a handful of experiments a year, it may have only one success or, if it’s unlucky, none. Then failure is a big deal.

At the companies I studied, the success rate for ideas tested early in the development of a brand-new offering is even lower. Early failures, however, allow developers to quickly eliminate unfavorable options and refocus their efforts on more-promising alternatives.

It’s actually less risky to run a large number of experiments than a small number.

In experimental cultures, employees are undaunted by the possibility of failure. “The people who thrive here are curious, open-minded, eager to learn and figure things out, and OK with being proven wrong,” said Vermeer, who now oversees all testing at Booking.com. The firm’s recruiters look for such people, and to make sure they’re empowered to follow their instincts, the company puts new hires through a rigorous onboarding process, which includes experimentation training, and then gives them access to all testing tools.

Insist That Data Trump Opinions

The empirical results of online experiments must prevail when they clash with strong opinions, no matter whose opinions they are. This is the attitude at Booking.com, but it’s rare among most firms for an understandable reason: human nature. We tend to happily accept “good” results that confirm our biases but challenge and thoroughly investigate “bad” results that go against our assumptions.

The remedy is to implement the changes that experiments validate with few exceptions. As one director at Booking.com told me, “If the test tells you that the header of the website should be pink, then it should be pink. You always follow the test.”

Andreas Feininger/The LIFE Picture Collection via Getty Images

A long-exposure photograph by Andreas Feininger captures the light trail of a helicopter.

Getting executives in the top ranks to abide by this rule isn’t easy. (As the American writer Upton Sinclair once quipped, “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”) But it’s vital that they do: Nothing stalls innovation faster than a so-called HiPPO—highest-paid person’s opinion.

Note that I’m not saying that all management decisions can or should be based on online experiments. Some things are very hard, if not impossible, to conduct tests on—for example, strategic calls on whether to acquire a company.

But if everything that can be tested online is tested, experiments can become instrumental to management decisions and fuel healthy debates. Sometimes, those discussions might result in a conscious choice to overrule the data. That’s what happened with one decision involving a comedy series at Netflix, which has built a sophisticated infrastructure for large-scale experimentation. According to a Wall Street Journal article published in 2018, the company’s executives were torn when tests showed that a promotion featuring an image of only Lily Tomlin, one of the stars of Grace and Frankie, resulted in more clicks by potential viewers than promotions featuring both Tomlin and her costar, Jane Fonda. The content team worried that excluding Fonda would alienate the actress and possibly violate her contract. After heated debates that pitted empirical evidence against “strategic considerations,” Netflix chose to use images that included both actresses, even though customer data didn’t support the decision. However, the experimental evidence made the trade-offs more transparent.

Democratize Experimentation

As I’ve noted, any employee at Booking.com can launch an experiment on millions of customers without management’s permission. About 75% of its 1,800 technology and product staffers actively use the company’s experimentation platform. Standard templates allow them to set up tests with minimal effort, and processes like user recruitment, randomization, the recording of visitors’ behavior, and reporting are automated. A core experimentation team and five satellite teams used to provide training and support to the whole organization, but because the firm’s needs evolved, that structure was recently changed to four central teams that report to Vermeer and specialists (“ambassadors”) that are placed in product teams.

To get things rolling, individuals or teams fill out an electronic form, which is visible to all and includes the name of the experiment, its purpose, the main beneficiaries (customers or suppliers), related past experiments, and the number of modifications to be tried out in A/B, A/B/C, or A/B/n tests. Once an experiment is up and running, the team watches it closely for the first few hours; if its primary or secondary metrics tank quickly, the team can stop the test. After that initial period, the platform continues to automatically run data-quality checks and sends warning messages if something is odd. To encourage openness, Booking.com maintains a central searchable repository of past experiments, with full descriptions of successes, failures, iterations, and final decisions. And everyone can see the real-time information generated by ongoing experiments.

“Somewhat ironically, the centralizing of our experimentation infrastructure is what makes our organizational decentralization possible,” Vermeer explained to me. “Everyone uses the same tools. This fosters trust in each other’s data and enables discussion and accountability. While some companies, like Microsoft, Facebook, and Google, may be more technically advanced in areas like machine learning, our use of simple A/B tests makes us more successful in getting all people involved; we have democratized testing throughout the organization.”

Democratization, of course, has its challenges. One is the risk that teams or individuals could break something on Booking.com’s high-traffic website, causing it to crash. Another is that each team has to set its own direction and figure out which user problems it wants to solve. That requires extensive training and ongoing discussions among team members about what the right problems are. Debates are encouraged, and people reach out to colleagues if they see anything that strikes them as questionable. Just as anyone can launch an experiment, anybody can stop one. However, this happens only on the rare occasion when an experiment has gone catastrophically awry—for example, if someone is alone in the office at night and sees that an experiment is causing a key metric like the customer conversion rate to plunge and will cost the company millions of dollars in revenues if it continues.

This system gives teams the autonomy they need to try out new approaches they believe are valuable and allows people throughout the company to monitor the experiments and provide feedback in real time. It truly liberates everyone to test any idea about how to improve Booking.com’s business.

Be Ethically Sensitive

When contemplating new experiments, companies must think carefully about whether users would consider the tests to be unethical. While the answer isn’t always clear-cut, organizations that fail to examine this question risk sparking a backlash. Take the weeklong experiment that Facebook ran in 2012 to learn whether emotional states were contagious on its platform. Facebook rejiggered its news feed—an algorithmically curated list of posts, stories, and activities—to see whether viewing fewer positive news stories led people to reduce their number of positive posts. The network also tested whether the reverse happened when people were exposed to fewer negative news stories. The experiment involved nearly 690,000 randomly selected users, about 310,000 of whom were unwittingly exposed to manipulated emotional expressions in their news feeds, while the rest were subjected to control conditions in which a corresponding number of randomly chosen posts were omitted.

When researchers from Facebook and Cornell University published the results in an academic journal, public outrage erupted. Facebook’s data science team had been running experiments on unsuspecting users for years without controversy, but the emotional manipulation struck a nerve. Critics raised concerns about whether the participants’ consent to Facebook’s general data-use policy sufficed; they felt the company should have made it clearer that users could opt out of testing and that data was collected for research. From a learning perspective, the experiment was a success: It found that emotional contagion existed online, though the effect was very small. But some users felt that Facebook had exploited them in the name of science.

Research suggests that companies that test new ideas first face greater customer scrutiny than competitors that implement new practices without conducting any experiments. In a published analysis of 16 studies in domains such as health care, vehicle design, and global poverty, bioethicist Michelle Meyer and her collaborators concluded that participants considered A/B tests to be more morally questionable than the universal implementation of an untested practice (A or B) on the entire population—even when both treatments were unobjectionable.

Clearly, ethics training and some kind of oversight are necessary. The challenge is conducting the latter in ways that don’t make people overly cautious or tangle them in red tape. For those precise reasons, Booking.com has shied away from imposing rules from on high about what kind of tests can be run. Instead, it encourages employees to ask whether an experiment or proposed practice would help or hurt customers. “I’d rather stay away from policing or ethical review boards,” David Vismans, Booking.com’s chief product officer, told me. “That’s not a scalable solution. You’d create a bottleneck, and testing police don’t make people feel like they’re empowered.” Instead, the company encourages debates in internal online forums that are open to all employees. The debates can be vigorous and have tackled issues like the use of techniques to persuade customers to complete transactions (for example, messages such as “Please book now or you will lose this reservation” or “Only three rooms left”). “I would rather have a community that is self-correcting,” Vismans explained.

To that end, Booking.com’s onboarding process also includes ethics training. LinkedIn, another company with a large experimentation program, takes a slightly different approach. It has created internal guidelines that state the company won’t run experiments “that are intended to deliver a negative member experience, have a goal of altering members’ moods or emotions, or override existing members’ settings or choices.”

Embrace a Different Leadership Model

By democratizing experimentation and following test results where they lead, companies can enable employees to make good decisions on their own and accelerate innovation and improvements. But if most decisions are made this way, what’s left for senior leaders to do, beyond developing the company’s strategic direction and tackling big decisions such as which acquisitions to make? There are at least four things:

Set a grand challenge that can be broken into testable hypotheses and key performance metrics.

Employees need to see how their experiments support an overall strategic goal. Say Booking.com’s senior leaders challenged employees to design the best online experience in the industry. They might expect that a superior experience would generate more customer traffic, which would attract more suppliers to Booking.com’s platform, helping expand the customer base and activity even more. To discover ways to pursue that goal, employees could devise hypotheses and related metrics—for instance, that underlining important text would increase conversion rates by making critical information easier to find, and that a “one click, no cost” cancellation option would boost user return rates without causing net hotel bookings to drop.

Put in place systems, resources, and organizational designs that allow for large-scale experimentation.

Scientifically testing nearly every idea requires infrastructure: instrumentation, data pipelines, and data scientists. Several third-party tools and services make it easy to try experiments, but to scale things up, senior leaders must tightly integrate the testing capability into company processes. Doing so requires striking the right balance between centralization and decentralization.

In centralized groups, dedicated specialists such as developers, user interface designers, and data analysts can run experiments for the entire company and focus on introducing state-of-the-art methods and tools. But if testing is limited to a small group of specialists, it will be hard to scale up experimentation and change a company’s culture. In decentralized testing, firms spread specialist teams throughout different business units. While this approach expands experimentation to more parts of the organization, it can hinder knowledge sharing and lead to conflicting goals and poor coordination among specialists. Decentralization may be needed to get the broader organization involved at first, but after that, firms should turn to improving their experimentation capabilities. That’s what Booking.com did. It initially used satellite teams to spread experimentation across the company but found that they were too busy supporting users to focus on building firmwide capabilities. To address that problem and align the teams better, Booking.com recently switched to a center-of-excellence model that supports business units, standardizes the company’s approach to experimentation, and makes sure that best practices are adopted and followed.

Be a role model.

Leaders have to live by the same rules as everyone else and subject their own ideas to tests. “You can’t have an ego, thinking that you always know best,” Tans told me. “If I, as the CEO, say to someone, ‘This is what I want you to do because I think it’s good for our business,’ employees would literally look at me and say, ‘OK, that’s fine, we are going to test it and see if you are right.’” Bosses ought to display intellectual humility and be unafraid to admit, “I don’t know.” They should heed the advice of Francis Bacon, the father of the scientific method: “If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts, he shall end in certainties.”

Recognize that words alone won’t change behavior.

Ultimately, being a leader in an experiment-driven organization means letting go and empowering employees to perform their own tests—which doesn’t happen by simply telling people that they can do so. It requires a concerted effort like IBM’s.