Voluminous data collection is changing the management of organizations. (Photo by Steve Castillo)

The explosion in digital data allows managers to measure and know radically more about their businesses, which they can then translate into improved products or entirely new ones. Stanford Graduate School of Business economics professor Susan Athey, who consults for Microsoft and other technology companies, discussed with Stanford Business how voluminous data collection, cheap storage, and machine learning from data troves is changing the management of organizations — not just internet-driven newcomers but also traditional businesses. Here are excerpts from the interview.

How do you see Big Data technology reshaping management skills in Silicon Valley and beyond?

Managers need to be able to understand and evaluate the output from these analysts, and they often find that if they cannot directly engage in it themselves, they are left out or left behind. Susan Athey, professor of economics

At the internet-related firms, most of which have a significant presence in the Silicon Valley, there is an enormous demand for new and different skill sets created by big data. Clearly, there is a need for large-scale data analytics. Within analytics, there are people who write the code to pull data from very large data sources and aggregate it into a form that's more useful. There are people who do the fairly simple statistical analysis on that data, and then there are people who do much more complex statistical analysis involving machine learning or econometrics, such as modeling that predicts which link on a web page is going to get clicked on or which products consumers should be offered when they come to a Web page. At high-tech firms, managers need to be able to understand and evaluate the output from these analysts, and they often find that if they cannot directly engage in it themselves, they are left out or left behind. Beyond analyzing data that is created in the regular course of business, a new domain involves managing large-scale experimentation platforms and analyzing data from experiments.

I've heard that Google specialists run thousands of experiments a year and that the data from those experiments determine company direction more than human managers. In fact, some data specialists at internet companies refer dismissively to ideas and intuition that are not grounded in data as hippos, which stands for "highly paid person's opinions." Is that an accurate characterization?

Management clearly is changing. At companies like Google and Microsoft, even the smallest change to a search engine algorithm goes through a mandatory experimentation process. That means that products only see the light of day if they make it through rigorous statistical scrutiny. Since the firm's core product development is operating in a way that's governed by data, people in finance and business planning can't perform their functions without evaluating studies and predictions from the statisticians and data scientists. Thus, even MBAs who might be delegating the analysis of the data still need to consume sophisticated analyses and be able to communicate with the engineering and product teams, whose professional language is data.

In my experience, MBAs who have good business intuition but also can speak the language of the statisticians intelligently are rock stars. These are people who have a good grasp on what data can prove and what it can't, and how to use data effectively to make decisions. They know how to use data to prove a point, present the information well visually, and pull together a collection of empirical facts that all support the main conclusion. People with this set of talents are poached by other firms and quickly promoted. They're giving executive presentations and are the go-to people on any sort of major strategic projects. This is the direction to expect other industries to move in as well.

Besides running online experiments, what is the next big change coming from Big Data for business?

Real-time collection of data from monitoring devices or from user interaction with websites enables machine learning in real time, which can improve a company's performance relative to competitors. The internet firms are on the cutting edge of automated decision making rooted in machine learning, but it will be valuable elsewhere also.

Could you explain how machine decision making using large data sets works?

Think of a search engine, which is the ultimate machine-learning algorithm. The intent of the person who makes the query changes over time. If you type Amanda Bynes into the search box today, the search engine looks very quickly at what other Amanda Bynes searchers today clicked on first when they were presented with a web page of results. Today matters because what you want could be different than it was yesterday.

When Michael Jackson died, for instance, there was a huge spike in internet traffic, and the search engine companies wanted to be able to figure out in the first 30 seconds to stop sending people to general pages about the performer and start sending them instead to the latest news. By using the latest data — crowd-sourcing what you want — a search engine can be a quick learner.

All search engines try to do that, but how well they do it is a function of how fast they get the data. So Google will do it faster than Bing, because more people come to Google first. Amazon can beat out smaller retail operations. If you type "stroller" into Amazon, the algorithms figure out the best design of the web page for you personally. The algorithm uses a combination of what kind of consumer you are and what consumers like you clicked on in the very recent past.

When I talk with people outside of the internet companies about machine decision making, they are often not aware of how important vast quantities of data can be. Indeed, for many years, artificial intelligence researchers thought that if they understood the link structure of the internet and the structure of language, that would be enough to help people get good search results. It turned out that having a lot of data on how people behaved while searching was also crucial. Just knowing the most common things that people type after a particular three-letter sequence can be more important than a lot of semantic understanding. This implies that in industries where machine learning is crucial to the quality of the product, you would expect to see a lot of concentration, and new firms will have difficulty getting off the ground.

Is that why investors sink millions into startups that make no profit but grow troves of data?

Take the case of mobile phone services. If you can get a lot of your mobile phone users to use your voice services, you will become better at voice recognition, which creates a higher-quality product for your consumers. It becomes a source of differentiation in your core product, but also it's a capability you can extend. That data is valuable for learning how to understand speech in a variety of contexts. You might imagine that companies that gather a large corpus of that data could sell it as a standalone product to other companies who are not direct competitors but who need speech recognition. All of the products that are touched by interfacing with humans will have that feature. So, if humans are trying to type text or speak or use touch or handwriting or gestures to interface with a device, then a company that has a large corpus of that input will have an advantage at understanding the input faster and better.

Voice recognition is an example of this more general phenomenon in which you see some of the internet firms integrating in a lot of directions because they want to gather more data. They don't want their competitors to have data that makes the competitors' products better, and they can also try to monetize the data in other ways, for example through personalization and better-targeted advertising.

In what other industries do you expect to see Big Data disrupt business as usual?

It will be interesting to see what domains will use data effectively sooner rather than later. You might think, for example, that getting an automated airline reservation system to change a connecting flight reservation when your first flight has been delayed would be simpler than getting a car to drive itself. With caller ID, the airline should know who you are when you call from the tarmac. Yet, many a passenger has been frustrated by the time-consuming airline phone tree encountered in that situation, and only recently have we seen real improvement in that experience. Cars, on the other hand, have demonstrated they can drive themselves safely. Another fascinating use of data sensors is to monitor the parts of a complex machine, such as a car or an airplane, to learn how to improve safety or when to replace worn parts. Medical diagnosis may also be a case where machines poring through petabytes of data might be quicker or more accurate than doctors, particularly for rare conditions, or cases where treatments have unusual side effects for particular populations of patients.

Some cases might surprise you. You might have thought it was pretty much impossible to break into something like the taxi business, since it is so highly regulated and local government is sensitive to the industry. Yet companies have succeeded in many cities, and they use real-time demand data to raise prices in times of short supply, ensuring that people who are willing to pay enough can always find a ride.

What about finding new uses for old data?

There are huge opportunities to answer important policy questions, both public policy and business policy questions, using operational or passively collected data designed for another purpose. All of the social sciences are actively mining data from social media such as Twitter, studying everything from happiness to adolescent social norms to the underpinnings of political upheaval. Both academics and the financial industry have mined sources such as Google Trends, finding that patterns of search behavior can be used to predict flu outbreaks, unemployment statistics, as well as stock returns. Another growing trend is for firms with access to large datasets to partner with researchers from academia, where the firm learns from the researchers' expertise while the researcher is able to answer questions that can only be studied using proprietary firm datasets. The resulting published research only produces aggregate statistics, protecting the confidentiality of the data. This is something I've done fruitfully myself, with Microsoft Research, but other companies such as eBay and Yahoo have also successfully worked with academics.

Consider a less obvious example: cities. New York City, Chicago, and some entire countries, are doing large-scale data collection now. As this data becomes available, it is used for its direct purpose like data-driven policing or traffic and transportation-flow management. But we're just starting to see the possibilities that can be unlocked through the secondary uses of the data, data concerning things like noise, energy use, and pollution at a very granular level. You might gather noise data to identify violations of a noise ordinance but then find you can use it to study the effect of noise on the health of children. You might gather data about taxi trips to monitor compliance with various regulations, but end up learning about commuting patterns, gaps in public transportation, and even the propensity of different types of customers to tip. I expect to see that in businesses as well. They may be passively collecting information about what their customers are doing with their cars. They may discover, along the way, patterns in how customers are using the cars that have implications for the design of transportation systems generally, for urban planning, as well as for how to design future cars.

It can be difficult for companies and governments to enable full utilization of the data they possess because of confidentiality and data-security issues, but more amazing uses will inevitably come. Any industry could be the next one to rethink and innovate in a dramatic way.