From complex techniques only used by academic statisticians, data science has risen to extreme popularity in only a few years. Roger D. Peng , Professor of Biostatistics at Johns Hopkins University and founder of one of the largest data science online courses, helps us understand this discipline and recommends the five best books to delve into it.

Before we take a look at your selection of the best data science books, can you tell us a bit more about your background? How did you go from biostatistics to data science?

In my mind, most of what I did as a biostatistician is the same as what I do as a data scientist. Data science is a pretty big tent. It encompasses a lot of people, and that’s kind of the point. One of the reasons this new concept of ‘data science’ has appeared in recent years is that it covers a wide range of activities that many people have been doing all along. Personally, I started in biostatistics, but one of the things that really got me more into this community is the series of courses we started teaching on data science and the R language through Coursera with my colleagues Brian Caffo and Jeff Leek. From there, it’s taken off—we didn’t anticipate how much interest there would be in learning about these tools.

How many people in total have taken the online data science specialisation that you’re teaching?

In terms of the full specialisation and the final certificate, there have been around 8,000 students now, but millions of people have taken one or two courses in the entire series. This success really continues to amaze us.

Why is it that data science has proven to be such a popular subject on online learning platforms like Coursera and edX?

A lot of things converged at the same time, around 2012. Technology is making data easy to generate, but at the same time, there aren’t a lot of people who can look at this data, and have the skills or know the tools to make sense of it. At the time, there wasn’t a lot you could do to learn this online; you had to go get something formal like a Master’s degree. When Andrew Ng came along with his machine learning class, it was a revolution. Suddenly, thousands of people could enjoy his teaching for free.

“Data science is a pretty big tent. It encompasses a lot of people, and that’s kind of the point.”

That’s what got things going, and then we came along with our data science program. It was low cost, online, and it covered the whole spectrum of data science, including things like setting up a Github repository, building a Shiny application, and of course using machine learning. Now of course there are many resources out there; the landscape has changed dramatically in only a few years.

Are there some insights from those millions of students have taught you about how data science is best learned?

It’s hard to make stark pronouncements in this area. Everything is changing very fast—the technology included—and the use cases for data science are growing every day. One thing we’ve learned from teaching these courses is that the heterogeneity of online students is at its absolute maximum. That said, there are a couple of things that come through. As statisticians, we emphasize statistical thinking because we think it’s very important, and it should be a part of any good data science program. And in terms of the computer science part, certain tools are also better than others; we choose to focus on R.

Read 1 Statistical Evidence: A Likelihood Paradigm by Richard Royall Read

For your first book, you decided to choose Statistical Evidence: A Likelihood Paradigm, by Richard Royall. Can you introduce it for us?

Richard Royall used to be a professor at Johns Hopkins University, but he retired before I joined. This book was actually given to me when I first arrived, and it revolutionized the way I think about data analysis and statistical thinking. It’s a very small book, quick to read, but I’ve gone through it probably twenty or thirty times. Every time, I get something new out of it. It’s a little technical and on the mathematical side; you do need some statistical background to read it.

It talks about the distinction between what the data gives you and what happens when you combine the data with outside things. He explains the different inferential paradigms in statistics, including frequentist and Bayesian, and he presents this middle road that he calls ‘likelihood’. His main point is that there are things that we do that we can trace back to the data, but other inferential tools that we use only depend on our assumptions about the world. We need to separate those two things, establish what the data says, and then decide what we’re going to use it for (like making a decision, enrolling patients in a trial, etc.).

Support Five Books Five Books interviews are expensive to produce. If you're enjoying this interview, please support us by donating a small amount.

We often make decisions by combining data with outside elements, and we need to be conscious of this. Many tools try to wrap those things together and they make things very confusing. P-values are an example of this in frequentist statistics: they become confusing because they combine results coming from the data with assumptions about the world. Royall’s way of thinking was new to me, and it had a profound effect on how I approached data analysis. A lot of the discussion about data analysis tends to lump things together, but many steps have nothing to do with the data specifically—they’re ‘data-adjacent’, for lack of a better word. The role of the data analyst is important, but the role of the scientist or policy-maker is different, and we need to think of them separately.

How much of a statistical background does the book assume?

It assumes that the reader knows statistics up to an introductory course. You don’t have to know calculus, but you should be comfortable with seeing some mathematical symbols.

Read 2 Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau Read

Your second choice is quite different: it’s Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics, by Nathan Yau. Can you introduce us to the author and what he does, and why you think the book is important?

Nathan Yau is a statistician, with a PhD from UCLA (University of California, Los Angeles) like me, although our time there did not coincide. He worked with Mark Hansen, who is an expert in data visualization and design thinking. Yau has a long history of thinking carefully about data visualization through his blog ‘Flowing Data’. He’s written a lot about how to present data. There are different times during a typical analysis where you’ll want to visualize your data: the main ones are during the early stages when you’re exploring the data, and in the later stages when you’re presenting your findings. This book is about how best to present data to other people, what are the tools that you can use, and the types of visualizations that you can make.

He’s an incredible thinker in this area, very meticulous and careful. Some of the examples he presents on his website are really well-designed and thought out. And this book is a great representation of the incredible work he’s doing. One thing that you can learn from it is a process for thinking through what you’re doing, and meticulously making sure that your visualizations have the impact that you want them to have.

Data visualization rarely receives a lot of emphasis in data science programs. Students learn a whole lot about statistics and programming, and maybe spend a few hours learning how to plot data with a package like ggplot2, but rarely do courses go deep into things like visual design. Do you think this is something lacking from most data science teaching?

It is lacking in many cases, but I think there’s a reason for that: it’s not that easy to teach it. It’s difficult to automate. You have to actually look at what’s been done and decide for yourself what’s good or bad about it, but there’s no easy formula for doing that. On the other hand, when you’re developing machine-learning algorithms, there’s fairly predictable process. For visualizations, we often present people with various tools that they can use, but it’s a more amorphous process in terms of teaching it.

Read 3 Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic Read

Your third choice is Storytelling with Data: A Data Visualization Guide for Business Professionals, by Cole Nussbaumer Knaflic. She used to work at Google. She left to write this book and (similarly to Nathan Yau) advise people on how to tell more compelling data-related stories—which includes visualization, but also other elements.

Yes, like Nathan Yau, she’s working out there by herself, and gives short courses and does consulting on data-related matters. She also has a fantastic podcast called ‘Storytelling with Data’. One of the things she hits on is the design aspect of data analysis. Yau really focuses on visualization and presentation. Knaflic takes it a bit more broadly, and focuses on things like the audience who’s going to be on the receiving end of the analysis or report. It’s important to think in terms of what they need, and what would be best for them among the many choices you could make when analysing data.

Another important idea of hers is to develop a narrative in data presentation. When doing data analysis, you’re creating many options for yourself by creating hundreds of plots, fitting thousands of models looking at different aspects of the data. But towards the end, you have to synthesize all that into something coherent. When students learn this process, I’ve often seen them come with a 50-page printout of everything that the software produced, but in reality they’ll have to find a way to reduce that to a set of three or four pages. The way you do that is by building a narrative that goes from A to B to C to D. Once you can figure out what that story is, then you can pick the plots and tables that help you with that. Some of the references in the book even come from ‘actual’ writing, such as screenwriting. It’s an element that we often forget when teaching data science, by pretending that you’re done once you’ve understood the models and their output.

Get the weekly Five Books newsletter

In a way, the output of a data analysis is three quarters of the way to the end. The final quarter is selecting the various things that you’ve done and building a final ‘data product’ from them. This is true even if you’re not writing a paper or presenting to the CEO. Even if you’re just turning to your neighbor or sending an email to somebody, there is a process of ‘dimension reduction’ that occurs, whereby you’re selecting among the various things that you’ve done to only present a few of them.

Read 4 An Introduction to Statistical Learning: with Applications in R by Daniela Witten, Gareth James, Robert Tibshirani & Trevor Hastie Read

Your fourth choice goes back to statistics with a classic of the field: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Tell us about it.

This book is written by a powerhouse of authors in the machine learning community, true authorities in the field. But beyond that, they’re also great writers. There’s another book by the same publisher called The Elements of Statistical Learning which is a bit more advanced, but this one can capture a much wider audience. If you really want to get into the guts of the models and statistics tools that are being used today, this is a great reference and a great way to learn. They have a ton of code out there to go with the book, including an R package to implement the models, run the examples, etc. Of the many books that you could choose in this category, it’s really one of the better ones.

The book uses R as the fundamental language to learn about machine learning. You also teach and use R in your own courses on Coursera. Is this choice of the R language simply due to your academic background, or do you think that R has intrinsic merits over Python, the other main language used for data science today?

I’ve been using R for twenty years now. I started using it when I was in school, and I didn’t know much about Python back then; I saw it mostly as a scripting language. Now, twenty years later, it’s hard to teach an old dog new tricks! That said, I think it would have been a very different story if R had not evolved the way that it has. It has grown tremendously, the ecosystem and the community have become huge, to the point that there are more things that you can do with it now that you could possibly learn. For the work that I do, it’s the perfect tool.

Support Five Books Five Books interviews are expensive to produce. If you're enjoying this interview, please support us by donating a small amount.

There are different phases in data science on any given project, and some tools are more suitable to some phases than others. In the first phase of exploring and looking at the data, I think that pretty much any tool is useful there; all you want is something you’re familiar with, so that you can work quickly without the tool getting in the way. But as you get to the final stages like modeling and producing the final results, you want to make sure that you can ensure things like reproducibility, consistency, and robustness. And here, Python and R are obviously two good languages.

It’s quite a in-depth book, and a bit more on the mathematical side than other similar resources—and much more math-heavy than a lot of recent courses that show you ‘off-the-shelf’ machine learning algorithms that you can use, without telling you too much about the underlying model. Do you think there is something to be said for diving deep when learning data science, and resisting the temptation to use easy, pre-packaged black boxes?

Those things aren’t necessarily mutually exclusive. Most people can actually start with some off-the-shelf algorithm and see how it performs; but that will only take you so far. As time goes on, I think you’ll quickly reach the limits of off-the-shelf solutions. Once that happens and for whatever reason your pre-packaged machine learning software isn’t performing what you need it to do, you have to know what’s going on if you want to make changes, or even do something completely new. Having an understanding of what’s going on underneath the software, and being able to make improvements, is like an edge that you can carry with you in your career.

Read 5 Design Thinking: Understanding How Designers Think and Work by Nigel Cross Read

Finally, you chose Design Thinking: Understanding How Designers Think and Work, a 2011 book by Nigel Cross. It’s been the focus of a recurring book club in your Not So Standard Deviations podcast with Hilary Parker. It seems to be quite a dense book, with many interesting concepts to unpack—even though it is, at least at first sight, not related to data science.

The five books that I chose are probably at quite a high level of abstraction in terms of data science. I could have chosen a bunch of books about detailed statistics, R programming, etc. but I purposely chose to go a bit higher up, and this book might be the epitome of this choice. Being a person who’s done a lot of data analysis, one thing that I’ve found frustrating is the lack of proper mental model for what is going on when you analyse data. Most universities have a class called ‘data analysis’, and typically it presents various useful tools, but rarely discusses what actually happens when you do the analysis itself. Design Thinking gives a mental model to describe what happens in this process.

“Even the best data analysts I’ve worked with have trouble saying what they’re doing—they just do it.”

Yes, it is about design in general, not specifically about data analysis, but there’s a lot that we can borrow from that world to adapt those concepts to our needs. They provide a way of thinking and a vocabulary. Even the best data analysts I’ve worked with have trouble saying what they’re doing—they just do it. Every data analysis feels unique, so it’s very hard to generalize across different experiences.

In the podcast you actually mention this as one of your first thoughts when you came across the book—this idea that rather than reinventing the wheel by creating new concepts and vocabulary to describe data science processes, one could simply look at what has already been conceptualized in design, sometimes decades ago.

Exactly; a colleague of mine often says that there’s a reason it’s called research: you’re rarely inventing something completely new on your own, but rather you’re always borrowing and improving from what someone else conceptualized. I really think it’s a great book, even if you have no interest in design per se. It doesn’t take much of a mental leap to see how what Nigel Cross describes is relevant to what we’re doing in data science.

Do you think that the ability to think carefully about design is one of the higher-level skills that can take you from ‘data script kiddie’ to somebody who can truly bring insights to a real-world problem?

Yes, it’s a key step. If you take the example of the students pursuing a PhD in biostatistics here at Johns Hopkins, they spend three years with an academic advisor, essentially learning these types of skills. It’s hard to scale as a model, because it takes a long time and really requires focused learning and deliberate practice to master efficient data analysis across different kinds of data. Of course it’s very important to know the tools, but the tools change and are more susceptible to automation. On the other hand, things like design thinking are key parts of advancing up the ladder, and to becoming a leader in this area, who is able to coordinate the activities of a group of people.

You just mentioned PhDs and the difficulty to scale data science learning. For people who want to learn this subject, the main way to do it these days is to go on sites like Coursera, Datacamp, Kaggle, and learn on your own. Do you think there’s still a big difference between this method, and a proper PhD, in the quality of the teaching and advice you’ll get?

Actually, a lot of people who took our courses on Coursera already had advanced degrees, usually Masters but also PhDs. What they didn’t have was knowledge of these tools, and they needed to fill a gap, which online courses are perfect for. I think that any person who’s gone through a scientific program has really learned the same higher-level skills; maybe they just know Matlab instead of R for example. If you’re maybe younger or haven’t had that kind of education yet, then these kinds of courses are tremendously useful in getting you your first job—but the option to go further down the road and follow a good program will usually be very beneficial. The other important aspect is that some programs, including Bachelor’s degrees, can be extremely expensive. So following short programs and online courses is definitely a good way to test the waters and see if you like the subject.