

Contributing Editor Hadley Wickham is Chief Scientist at RStudio and Adjunct Professor of Statistics at Rice University. He is interested in building better tools for data science. His work includes R packages for data analysis (ggplot2, plyr, reshape2); packages that make R less frustrating (lubridate for dates, stringr for strings, httr for accessing web APIs); and that make it easier to do good software development in R (roxygen2, testthat, devtools, lineprof, staticdocs). He is also a writer, educator, and frequent contributor to conferences promoting more accessible and more effective data analysis. He writes:



Recently, there has been much hand-wringing about the role of statistics in data science. In this and future columns, I’ll discuss both the threat and opportunity of data science. I believe that statistics is a crucial part of data science, but at the same time, most statistics departments are at grave risk of becoming irrelevant. Statistics is flourishing; by-and-large academic statistics continues to focus on problems that are not relevant to most data analyses. In this first column, I’ll discuss why I think data science isn’t just statistics, and highlight important parts of data science that are typically considered to be out of bounds for statistics research.

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results. It’s rare to walk this process in one direction: often your analysis will reveal that you need new or different data, or when presenting results you’ll discover a flaw in your model.

Statistics has a lot to say about collecting data: survey sampling and design of experiments are well established fields backed by decades of research. Statisticians, however, have little to say about collecting and refining questions. Good questions are crucial for good analysis, but there is little research in statistics about how to solicit and polish good questions, and it’s a skill rarely taught in core PhD curricula.

Once the data has been collected, it needs to be tidied (or normalized) into a form that’s amenable for analysis. Organizing data into the right ‘shape’ is essential for fluent data analysis: if it’s in the wrong shape you’ll spend the majority of your time fighting your tools, not questioning the data. I’ve worked on this problem for quite some time (culminating in the tidy data framework) but I’m aware of little similar work by statisticians.

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling. Visualization and modelling are complementary. Visualizations surprise you, and can help refine vague questions. However, visualizations rely on human interpretation, so the ability to scale is fundamentally constrained. Models scale much better, and it’s usually possible to throw more computing at the problem. But models are constrained by their assumptions: fundamentally a model cannot surprise you. In any real analysis you may use both visualizations and models. But the vast majority of statistics research is on modelling, much less is on visualization, and less still on how to iterate between modelling and visualization to get to a good place.

The end product of an analysis is not a model: it is rhetoric. An analysis is meaningless unless it convinces someone to take action. In business, this typically means convincing senior management who have little statistical expertise. In science, it typically means convincing reviewers. Communication is not a mainstream thread of statistics research (if you attend the JSM, it’s easy to come to the conclusion that some academic statisticians couldn’t care less about the communication of results). Communication is a part of some PhD programs, but it tends to focus on professional communication (to other statisticians), not communicating with people who have substantive expertise in other domains.

In business, analyses are often not done just once, but need to be performed again and again as new data come in. These data products need to be robust in both the statistical sense (i.e. to changes in the underlying distributions/assumptions) and in the software engineering sense (i.e. to changes in the underlying technological infrastructure). This is a ripe field for research.

Statistics is a part of data science, not the whole thing. Statistics research focuses on data collection and modelling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products.

There are people in statistics doing great work in all these areas, but it’s not mainstream statistics. If you’re interested in these areas, it’s harder to get tenure, harder to get grants, and most of the ‘top’ statistics journals are unavailable to you.

Attempting to claim that data science is ‘just’ statistics makes statisticians look out of touch, and belittles the many other contributions outside of statistics.

What do you think? Let me know your thoughts at hadley@rstudio.com, or @hadleywickham.

Editor’s note: The opinions expressed are exclusively of the columnist and do not necessarily reflect opinions of the IMS or editorial opinions of the IMS Bulletin.