Data science is often said to be built on three pillars: domain expertise, statistics, and programming. Hadley Wickham , Chief Scientist at RStudio and creator of many packages for the R programming language, chooses the best books to help aspiring data scientists build solid computer science fundamentals.

First let’s talk about your journey to data and computer science. Today, you’re chief scientist at RStudio, and a highly respected figure in the data science community. Yet your initial background was in human biology and statistics. What brought where you are today? How did you start as a programmer, and when did you become involved in data science?

It was almost a process of elimination. It started off by going straight from high school to medical school in New Zealand, and realizing after a year that I didn’t want to be a doctor. I decided to go back to what I really enjoyed in high school, which was programming and maths. I had started programming quite early; my dad had a computer for work, so I started exploring things like Word, Excel, and Access, and writing little programs and macros. I spent some time in high school running databases for people, and also learned some PHP to create websites.

So I was already doing some programming when I started my bachelor’s degree, which was in statistics and computer science. I was surprised by the gap between the very theoretical aspect of a computer science degree, and what I had actually experienced when I programmed. The degree felt almost useless at the time. For example, I took a class on algorithms, in which I learned that doubling the speed of a program wasn’t particularly interesting, which I thought was a ridiculous thing to say (ironically this is one of the few courses that is still useful to me today). So I didn’t like the computer science part so much, but I really enjoyed the statistics part.

I was studying at the University of Auckland, which is the home of the R programming language; pretty much everyone was using it, and as a student I had to use it to do statistics. I found it very intriguing and interesting, and a skill that combined interests of mine. From there I went on to do an M.Sc. in statistics in Auckland, and a PhD in the United States.

How did you then decide to start writing your own R packages?

I was really lucky as a PhD student in the US; usually you’re assigned a position as a teaching assistant for a course in your department, but instead I got a consulting position where I helped students from other departments to do statistical analysis. This gave me a steady stream of people with statistics problems. It made me realize that fitting models was the easy part; what was hard was getting data organised in a way that made sense, instead of constantly fighting to get it in the right form, and then visualising it to understand what was going on. It became obvious that my repeated efforts to reshape and visualise data could be wrapped up in useful packages.

Your most famous contribution to data science, beyond the world of R, is what’s known as ‘tidy data,’ a concept you theorised in a paper published in 2014. Can you explain what tidy data is, and why it’s so important?

I realised recently that the paper was only published four years ago. It feels like it’s been much longer than that! The idea of tidy data is to get people to store data in a consistent way, so that all of their tools can work with it efficiently, without having to wrangle and reshape it every time. The basic concept is very simple: when you’re working with data, make sure that each column is a variable, so that each row is an observation. When you store data like that, your life gets much easier.

Get the weekly Five Books newsletter

It’s actually not really a new idea. It’s a rephrasing of the second and third normal forms in relational databases, which were my original programming background. But those two normal forms sound incredibly complex; they sort of make sense if you’ve spent years working on databases, but most people simply won’t understand them. So in a way, ‘tidy data’ consists of stating those norms in a way that makes sense for statisticians and data scientists.

Before we dive in and talk about the books, what is your view on programming languages when learning data science? Obviously your biggest contributions have been in R, but you’ve also worked on projects that try to bridge R and Python, and a lot of code you write behind the scenes uses C++. If data scientists want to build solid computer science fundamentals for themselves, do you think that they should learn another general-purpose programming language beyond R?

That’s an interesting question. C++ is very important for me as a tool, but the goal of my work is to write C++ code so that data scientists don’t have to. As a programmer, I think it’s intellectually satisfying to learn about programming languages, and see how other languages think about instructions and data. But many data science courses nowadays will try to teach Bash, SQL, Python and R all in the same course, which I think is bad idea; you’d never take a language course where you’d try to learn French, German, Italian, and Japanese at the same time.

“As a programmer, it’s intellectually satisfying to learn about programming languages, and how they think about instructions and data”

Pragmatically, if you’re a data scientist, learning the basics of SQL is really important. You should also have a minimal reading knowledge of R and Python, because so many data science teams use both. Then I think you’re better off specializing in one of these two and getting really good at it, rather than spreading yourself too thin and being mediocre at several languages. Obviously I think that R is an awesome language, but not because I have anything against Python… I just think that R is really great!

Read 1 Structure and Interpretation of Computer Programs by Gerald Jay Sussman, Harold Abelson & Julie Sussman Read

Let’s start talking about your book choices. The first one is Structure and Interpretation of Computer Programs. It’s a famous textbook that initially supported the introductory computer science course of the same name at MIT. From there it quickly became a reference for computer science learners everywhere. Why is it such an important book?

One interesting anecdote is that MIT no longer uses this book to teach its introduction to computer science. They’ve switched to Python instead of Scheme, which is the language taught in this book. The reasoning behind this is that the world doesn’t need more computer scientists; it needs some, but, by and large, what it needs is engineers who know how to use programming languages and achieve a goal, rather than thinking about the atomic constituents of computer science.

But this book is very useful for somebody like me, with experience in high-level engineering languages, like VBA, PHP and R. They’re incredibly useful languages, but ones that computer scientists generally disdain, because they’re not theoretically pure or beautiful. This book shows you how languages can be constructed. The most valuable thing it gives you is confidence and knowledge to go and create your own programming language. You get a very good understanding of some of the trade-offs that you have to make when designing languages. For example R does a lot of things that are very unusual among programming languages, and some of them could be considered mistakes, but a lot of them exist because R is trying to achieve a particular objective, and was thus designed following specific and sensible constraints.

Another similar and also interesting book is Concepts, Techniques, and Models of Computer Programming, which explains all the models of computer languages and how they fit together. But it’s even more complex than Structure and Interpretation of Computer Programs, so I’d stick with that choice for somebody getting started.

As you said, it teaches those concepts by using Scheme, which is a dialect of the Lisp language. Do you think that data scientists should make the effort to understand its basics, despite the language barrier? Scheme is a language with a very specific syntax, and it’s no longer taught to undergraduate students in most universities.

I would not describe Scheme as a useful language. It comes back to the question of why you should use one programming language over another. You should not make that decision based on the technical merits of each language, but instead based on the community of people who use it and are trying to solve problems like yours. The community of people using Scheme today is small, and somewhat esoteric, but there are interesting ideas to be learned in the language anyway. And it was very influential in the design of R. R itself is a hybrid of S, a language designed in the 1970s from a pure statistics standpoint, and Scheme; so I learned it to satisfy my curiosity about why the creators of R thought that Scheme was so great. Finally, Scheme is a functional programming language, rather than an object-oriented one; and functional programming is currently experiencing a resurgence of interest.

Read 2 The Algorithm Design Manual by Steven S. Skiena Read

The second book that you chose is The Algorithm Design Manual. Algorithms are a big part of computer science knowledge. Can you explain why you think it’s still important to learn about them, when most of them have been implemented for data scientists by people like you?

To me, this book is an illustration of the power of names. Today, in the era of Google, if you know the name of something, you can find out about it with a simple search. But if you don’t know of what you’re looking for, it suddenly becomes much harder to find it. Having in the back of your head the names of common algorithms that help you solve problems is really powerful. When you identify a new problem, it helps you to come up with ideas, for example to use breadth-first search, or a binary tree, etc.

The book also covers the important topics of computational complexity and Big O notation. Similarly, is it important for data scientists to study those topics, not necessarily because they’ll need to use them often, but in order to acquire the intuition that something requiring n*log(n) computations is preferable to something requiring n² ones?

Yes, it’s good to acquire a sense of that. A lot of statistical theory is about measuring what happens to mathematical properties when some variable x goes to infinity, without thinking about what then happens to computational properties. But if your algorithm needs n² computations, it doesn’t matter if x goes to infinity, because you’ll never be able to compute that.

Read 3 The Pragmatic Programmer: From Journeyman to Master by Andrew Hunt & David Thomas Read

Your third choice is The Pragmatic Programmer: From Journeyman to Master. This is another famous one, often used in universities. What is it about?

This is about the craft of software development, and thinking about how to produce good code. As the name suggests, it’s a very pragmatic and hands-on book. It really helped me on my journey as a software engineer, to be able to write quality code day in and day out, and be confident that it’s going to work correctly. It’s something that we never really talked about in my computer science education, and it’s certainly something that statisticians rarely think about. The goal is to turn an idea in your head into code that works, and that you can share with others.

How would you define ‘good code’ then?

I think there are three main parts. First, for code to be good, it has to be correct and do what you think it does. Ideally, you want to verify that correctness somewhat formally, by writing unit tests. The idea of unit tests is the same as double-entry bookkeeping: if you record everything in two places, the chances of you making a mistake in both places on the same item are very low. So unit tests don’t guarantee that your code is correct, but they make it much more likely to be correct.

But the requirements of your code will also change over time. So the second part, and maybe the bigger one, is to write code that will be correct in the future, i.e. easy to maintain. For example, it’s very important to write code that clearly communicates its intent; because you will come back to it six months later, having completely forgotten what you were trying to do. So the easier it is to read your code and understand what’s going on, the easier it will be to add new features in the future.

“When writing code, you’re always collaborating with future-you; and past-you doesn’t respond to emails”

The third part would be to make sure that it’s fast enough, so that it doesn’t become a bottleneck. It can be easy (and fun!) to get carried away with this, and obsess with writing code that’s exponentially faster. The important thing is to make sure that nothing is overly slowing down execution, to the point of interrupting the flow of your analysis, or meaning that your program has to run overnight. But it doesn’t matter how fast your code is, if it’s not correct and maintainable.

Read 4 The Art of Readable Code by Dustin Boswell & Trevor Foucher Read

On a similar subject, your fourth choice is The Art of Readable Code: Simple and Practical Techniques for Writing Better Code. The importance of writing readable code is often stressed when learning programming, especially when collaborating with others; but it’s also an aspect that’s easy to neglect on a day-to-day basis. How does this book help with that?

The problem with writing readable code isn’t to identify the problems; you can tell easily if your code is understandable or not. The challenge is to know how to make it better. The software development community often uses this idea that code ‘smells,’ to say that it’s badly written. What I liked about this book is that it gives you a series of techniques to make that smell go away.

Read 5 Style: Lessons in Clarity and Grace by Joseph Bizup & Joseph M. Williams Read

This relates strongly to your fifth choice, probably a less expected one – Style: Lessons in Clarity and Grace. This is about writing as well, but not necessarily writing code. Why did you choose it?

Similarly, it’s easy to look at a sentence or a paragraph and to say that it doesn’t make sense or is badly written. This book gave me the tools to analyze a text and identify the reasons why it doesn’t work, for example stating the topic of a paragraph only in the middle of it. I found it very useful to consciously analyze my own writing. That’s important with programming, because obviously you’re communicating with a computer, but more importantly you’re communicating with other humans. And humans are much harder to work with, because you can’t write unit tests to check their understanding. And even if you’re the only programmer working on a particular project, you’re actually always collaborating with future-you; and past-you doesn’t respond to emails.

“Writing well and describing things well is very valuable to a good programmer, and even more to a data scientist”

Knowing how to write clearly helps you to write code clearly, and also helps you writing good documentation and explain the intent of what you’re doing. Even very good code will only ever tell you how something has been implemented; it won’t tell you why a particular technique has been chosen. Writing well and describing things well is very valuable to a good programmer, and even more to a data scientist. It doesn’t matter how wonderful your data analysis is, if you can’t explain to somebody else what you’ve done, why it makes sense, and what to take away from it.

You make this distinction between writing for computers and writing for humans, but one of the characteristics of your work has been to use elements of style and clarity to enhance the R language. You often talk about the importance of semantics and grammar in code, for example in ggplot2, your data visualization package that’s based on the theory of grammar of graphics. It’s also visible in the way that the tidyverse has completely changed the way data scientists write code in R, including the iconic ‘pipe’. What made you place so much importance on semantics and grammar in programming?

Partly because of another book that nearly made it onto my list: Domain-Specific Languages by Martin Fowler. It talks about the idea of writing a small language inside another language, to express ideas in a specific domain, and the idea of ‘fluent’ interfaces, that you can read and write as if they were human language. There have actually been attempts, for example by Apple, to write programming languages that were exactly like human language, which I think is a mistake because human language is terribly inefficient, and relies on things like tone and body language to clarify ambiguity. But thinking about how you can make a computer language as similar as possible to a human language is important. It can take simple forms, like thinking of functions as verbs, and objects as nouns, so you can draw on the grammatical intuition that comes from human language.

Another thing I’ve been exploring lately is the question of foreign languages. The tidyverse gives you access to all of these verbs, but they’re all in English. Should we have translations of the tidyverse? Could we have a Spanish tidyverse, with Spanish equivalents of the verbs? Of course it raises many problems, the biggest one being that 75% of the resources available on sites like StackOverflow are in English, so the answers wouldn’t be universal anymore. But that’s an interesting area where we’re running small experiments; there’s a group of Spanish speakers working on a translation of the R for Data Science book, which includes translating some of the datasets that are used in it. I’m very interested to see where that goes, and how useful it can be to aspiring data scientists everywhere, especially when R is quickly democratizing access to the subject, well beyond the academic world.