The confusion continues till date

The growth of data has been exponential. According to an IBM report, 2.5 quintillion bytes of data are created per day. This has created a new class of professionals — data scientists. The question is, is data science another ‘hot’ job or a new form of science? In the Hollywood movie 21, six students, brilliant with numbers, make money at the blackjack tables of Las Vegas casinos by using numbers, codes, and hand gestures. Can we call them data scientists?

The ‘fourth paradigm’

In 2009, pioneering computer scientist Jim Gray argued that data science is the “fourth paradigm” of science, the other three being empirical, theoretical and computational paradigms. In terms of the volume of data to be handled nowadays, it certainly sounds sensible. However, data have always played a major role in scientific developments and the growth of knowledge, not just now. About two centuries ago, Charles Darwin’s theory of natural selection was largely based on observational data that he collected during his voyages around the world. About 150 years back, Gregor Mendel developed the laws of Mendelian inheritance from the the data he collected from his experiments on peas. So, historically, science has been data-driven. What has changed is that with the Internet, there is more data available now.

Statistics, according to the American Statistical Association, is the “science of learning from data”. So there is huge scope of confusing data science with statistics. Statistics is a data-driven science, but it focusses on developing theories based on data insights. In the early 1900s, William Gosset, under the pseudonym Student, used the Guinness brewery data to develop the famous Student’s t-distribution. Was he a data scientist? Important theories of statistics were developed by small data quite often. Take an interesting example from the 1930s. A woman colleague of the legendary statistician R.A. Fisher claimed that she could identify whether tea or milk was added first to a cup. In order to verify this, Fisher prepared eight cups of tea, of which milk was added first in four cups. The woman could correctly identify six cups, three from each group. Fisher analysed the data by his newly developed Fisher’s exact test. Half a century on, this ‘Lady Tasting Tea’ experiment would be treated as one of the two supporting pillars of the randomisation analysis of experimental data. There is no doubt that statistics was primarily data-driven. In 1997, C.F. Jeff Wu gave a famous lecture entitled “Statistics=Data Science?” at the University of Michigan. The confusion somewhat continues till date.

Incidentally, the term data science was used initially as a substitute for computer science by Peter Naur in 1960. His book Concise Survey of Computer Methods defines data science as “the science of dealing with data.” ‘Dealing’ certainly includes cleaning, processing, storing and manipulating data, and the subsequent analyses of data.

Today, people expect a data scientist to know mathematics and algorithms, experimental design, engineering chops, and communication and management skills. A jack of all trades cannot be the master of anything. Yet people struggle to decide whether data science is statistics on a high capacity computer or not. More importantly, is a data scientist someone who is better in statistics than any software engineer, and better in software engineering than any statistician? Does data lead to “the end of theory”?

Small and big data

To me, data science appears to be a technology rather than a science, at least in its present form. Should we then call it data technology? A Harvard Business Review article of 2012 concludes that a hybrid of data hacker, analyst, communicator and trusted adviser makes a successful data scientist. A considerable part of the work of a data scientist is data cleansing. That is surely not the description of a statistician.

With the ocean of data at hand, the scope of data science might look limitless. However, due to the very nature of the expertise, over time, softwares will invariably take up much of the work of data scientists. For example, existing tools like Tableau have already eased the task of data visualisation.

In response to the new technological demand, statistics, the subject, did not completely surrender to the new hype of handling waves of data, and thus paved the way for developing a new set of experts. Many types of small data are of great challenge, even in this era of big data. Thankfully, statistics did not detract itself from the principle of theorising from data, big or small. We are possibly heading towards an era of softwares and algorithms. A shade of uncertainty remains with the advent of data science.

Atanu Biswas is Professor of Statistics, Indian Statistical Institute, Kolkata