Dataviz and the 20th Anniversary of R, an Interview With Hadley Wickham

Catching up with the creator of ggplot and tidyr about community, Tidyverse, and what’s in store for R over the next 20 years

On February 29th, 2000, R 1.0.0 was released to the world. At the time, it was a niche language for statisticians; now, over 20 years later, it has grown into a fully-featured programming language that is used by millions of people and has had incalculable influence in the world of data science and visualization.

R, along with Python, is one of the primary languages used today for data science and visualization. It has its roots in academia and is still widely used by researchers across nearly all disciplines, but the reach of R has expanded greatly in recent years. R has now been adopted by journalists around the world as a go-to choice for data wrangling and visualization, and by various industries where it powers AI models and data science products at some of the worlds largest companies.

Among the R community, Hadley Wickham is a celebrity. He is the creator of the seminal charting library ggplot, data wrangling libraries such as dplyr and tidyr, and a frequent target of extremely niche R memes. He currently works as chief scientist at Rstudio, where he leads a team of developers working on the Tidyverse, a suite of R packages to facilitate data science.

Hadley got his start programming Microsoft Access Databases, and was first introduced to R as a student at Auckland University (were R was created). He moved to Iowa State for his PhD and there developed ggplot, a charting library based on the grammar of graphics. He went on to write dozens of widely-used R packages for data wrangling, analysis, and visualization.

Hadley is well-known for his concept of “tidy data,” a principle that was inspired by his experience designing databases. He is also a huge advocate for making R welcome, open, and accessible, both as a language and community, and he is the author of four books (three of which are open source). Hadley was the recipient of the 2006 John Chambers Award for Statistical Computing, and in 2019 he received the prestigious COPSS Presidents’ Award for his “influential work in statistical computing, visualization, graphics, and data analysis” including “making statistical thinking and computing accessible to a large audience.”

In 2018, I decided to start blogging about R. For my first post, I thought “Wouldn’t it be funny (and gain me some Twitter followers) if I analyzed Hadley Wickham’s Twitter feed to see if he’s a cat or dog person?” I emailed Hadley to ask for his permission, telling him “I don’t want to be creepy,” and then with his blessing I went and wrote a slightly creepy blog post about him, which you can still read (forgive my charts, I was young and foolish). The post had it’s intended effect, and earlier this year I achieved even greater virality with a spoof I made of Hadley on the cover of the ‘Glamour of Graphics’ magazine (which I’m proud to say is now displayed on his parents’ refrigerator). Since it’s worked twice, I figured why not ride Hadley’s coattails a bit further, so I reached out to see if he would give an interview for the 20th anniversary of R. In that first email I sent back in 2018, Hadley responded with a single word: “Sure!” True to form, he gave the same response to my request for an interview. I spoke with Hadley — who generously gave his time for this interview — about the impact of R over the last 20 years, and the past, present, and future of dataviz in R. This transcription has been edited for clarity. W

Will Chase: You were first introduced to R during your undergrad at Auckland university in 2003, what was it like to program in R in those days?

Hadley Wickham: I never actually had a class that was specifically on R or R programming, but just about every statistics course at Auckland university used it and I had done quite a bit of other programming as a double major in computer science. I remember R felt like such a weird language that did things that I did not think were possible in a programming language. I primarily used R with the R GUI on Windows and all of my scripts from that time are .txt files so I don’t know exactly how I was editing them, I even have some code in Word documents from that time. So, it was a totally different experience in many ways. This was before Rstudio, and the only place to get help was R-help, which was pretty intimidating. I used mostly base graphics, base everything.

WC: Did you have any contact at that time with Ross Ihaka or Robert Gentleman (the creators of R)?

HW: No, none at all actually. I’ve had way more to do with them since I left Auckland than I ever did when I was a student there.

WC: Can you talk about the inception of ggplot? What was the motivation for the package and what was the early development like?

HW: I had been doing graphics in Lattice and there were a few things about Lattice that really annoyed me or weirded me out. For example, you could take the function for a scatter plot, and by adding an argument, turn it into a box-and-whisker plot. That just seemed ill-founded, on principle. And around that time I started reading The Grammar of Graphics by Lee Wilkinson, and I was like “Wow, this is describing exactly the properties of a plotting system that I would like to have, how do I use this?” And there was no way to use it. I think there was one commercial software that cost tens of thousands of dollars, so I was like “Well, why don’t I try doing this in R?”.

Then fairly early on in the development of ggplot, Lee Wilkinson visited Iowa State, where I was doing my PhD, and he was really supportive and gave me lots of helpful feedback — he’s been very supportive throughout the years as well. I think in many ways ggplot is the truest implementation of the grammar of graphics, and it’s inspired many other programs and packages over the years.

So that was ggplot, and at the time I was really into functional programming, so the original ggplot was a “functional grammar of graphics” so it was all about composing functions, which is actually a really good idea, but you end up with this awful syntax where you have to nest all your function calls.

WC: Yeah, I read an example you gave one time of the original ggplot syntax, it was… interesting.

HW: Yeah, so I think I correctly identified that problem. And if I had discovered the pipe at the time, it probably would have stayed like that, but I think I was reading about operator overloading and I thought “Oh maybe I could do this with ‘+’ instead”, and it kind of makes sense, you know, because you’re adding layers to the plot. I realized this was a fundamental change in the package and I didn’t want to break the code of the tens (or hundreds) of people using it at that time, so I decided to make a new package and give it a new name, and that’s how ggplot2 came to be.

WC: You’ve talked before about how R is a programming language that’s primarily used by non-programmers, and I’d say it’s probably the most widely distributed way to do modern dataviz, in terms of being used across academia, journalism, and industry. Do you think that widespread distribution is a consequence of the original design of R and ggplot, or is it the other way around, that you’ve considered this fact and designed APIs around it?

HW: I think it’s a bit of both. John Chambers’s original vision for S was as a language for statisticians — who are non-programmers — that sort of underpins R in a lot of subtle ways. So I think there’s definitely this idea in R that it’s a tool for non-programmers, and you should be able to do things without learning complex computer science concepts. At the same time, I started teaching R and ggplot pretty early on. I don’t really know how this happened, but I ended up teaching one of the graduate intros to statistical computing when I was a PhD student at Iowa State, with like no supervision.

And I still remember, I was given this curriculum to teach, and the only thing I remember about the curriculum is it came with a box of rocks, because one of the exercises was recording the weights of the rocks and doing something to it…and I was just like “This is ridiculous, I’m not going to teach this.” So I came up with a completely new curriculum. But I think teaching statistical computing to graduate students and undergraduate students from other disciplines was really helpful to me because it forced me to confront all these things that I kind of knew, but it was clear that other people didn’t know them and they were significant barriers to people using the tools. So that kind of pushed me to think about the user interface, and how to make something easier for people to learn, and how do you teach it, and document it along the way.

WC: In the same vein, you’ve talked about the importance of human-centered design in programming languages, and that’s clearly evident in the Tidyverse. So I’m curious if you have any advice for other tool developers on how to incorporate those ideas effectively.

HW: I think part of the problem is this idea that you can apply design principles to programming languages is something people have only realized quite recently, so there are relatively few resources around. I think the best you can do is general reading about design. The Design of Everyday Things is a classic and was really influential when I first read it. And then building up empathy for your users, like actually watching people do stuff and experience it, and then reflect on what makes it hard or easy. Then it’s just iteration.

WC: So when you’re designing APIs for your packages do you do specific user testing, or is it more informal kind of crowdsourced feedback?

HW: Yeah it’s more crowdsourced. I don’t feel like user testing is that useful for me anymore. I don’t know if that’s a function of whether I don’t need it, or I just refuse to accept the findings when I do it. But we do a lot of informal user testing internally, just sketching stuff out. One thing I do a lot of is sketching. Sometimes that’s just pen and paper, writing out how the code would look, and sometimes that’s typing up imaginary code, to see how it feels and imagine how it would work. And now that I have a team at Rstudio, we do a lot of that internally to get opinions, and then it gets thrown out to the wider world through Twitter and blog posts and such.

WC: You’ve slowed the development of new features in ggplot and shifted to an extension mechanism. What was your experience with cultivating this ecosystem of extension packages and developers? Do you think this is a model that can be applied to other languages or tools?

HW: There was basically no way to extend ggplot2 for the vast majority of its life. It took 5 or 6 years to get a proper extension system, and I think that was the right decision, because any extension mechanism has to be inextricably woven into the internals, and once you do that it makes it much more difficult to change anything on the inside because it breaks other people’s code. And extension doesn’t really pay off until you’ve got a large enough audience of people using it that some fraction of people will actually go off and extend it.

I think it was absolutely the right thing to do, and I think it could have been done a bit earlier in ggplot. Thomas [Lin Pedersen] pushed most of that development when he was an intern and he’s continued building that out since. And I like this model because I think one of the reasons the Tidyverse has flourished is that it’s built on this slow-moving but stable foundation of base R, and now you get the same thing with ggplot2 where it’s pretty big and slow-moving, but that’s OK because on top of that grows this much faster, nimbler extension ecosystem. But in terms of applying this to other tools, until you have that big user base I think it’s easy to spend a lot of time making something extensible and then not have anyone to extend it.

WC: For those of us that have come to the language in recent years, one of the hallmarks of R is the extraordinarily welcoming community, but I know that was not always the case. As someone who has lived through that transition in the community atmosphere, what do you think are the factors that contributed to what the community is today?

HW: I think the R community is kind of lucky because it’s connected to statistics, so it’s had this diversity in the domain sense, and then partly because of that, in the people sense as well. So I think for a long time there was this big pool of people that could potentially be contributing, but they were really put off by R-help (note: R-help is a notoriously hostile mailing list and was the only way to get help with R in the early days). And then the timing was lucky enough that there were two significant changes that allowed the community to reinvent itself to some degree.

The first of those was StackOverflow. It seems hard to imagine now, but at the time, StackOverflow was so incredibly welcoming and friendly. And I think part of that was that in contrast to R-help, anything would seem welcoming and friendly, and partly in the early days StackOverflow really was better and there were people involved in getting it going in the R community who were more explicitly about making a welcome environment. So there was that first reinvention when a lot of people switched to StackOverflow, and then again there was a similar shift when #rstats started being really popular on Twitter, that was another opportunity for the community to reinvent itself in a positive way.

And some of that I think is a founder-effect, like some of the people involved early on were explicitly pro-diversity. And some of it I think is this correlation between being older and less tech-savvy, and being older and less pro-diversity, so some of the old crotchety people just get left behind on the previous platform when people move on. More recently the Rladies community has just been tremendous. Though I think it’s hard to tease out the causation of how much Rladies is a response to increasing diversity or how much of it is actually causing increased diversity, and it’s clearly both and has been a nice positive feedback cycle.

WC: Are there any developments that you are working on now or trends in the larger R community that you think is really exciting?

HW: One of the things I’m excited about is the growth of really good R programmers. Like, they’re not just good at doing data science in R, but they’re actually really interested in programming in R and the craft of software engineering in R. And because these people are coming from this much bigger and broader R community, they tend to be much more diverse than the typical software engineer stereotype. And that’s one of the things I’m really excited to see from the Rladies community over the next few years, is that historically, the majority of the Rladies community were new R users, and now we’re starting to see more experienced developers and people becoming expert programmers and software developers. So I’m excited that people are generally getting better at programming in R and the people getting better do not resemble the classic software engineer stereotype.

WC: What do the next 20 years look like for dataviz in R?

HW: (laughs) I don’t know, this is one thing that Thomas [Lin Pedersen] and I have fundamentally conflicting views about (note: Thomas Lin Pedersen is currently the lead developer of ggplot at Rstudio). I think the future of visualization in R is fundamentally and inextricably tied to visualization in JavaScript. And it’s really about how do we make that connection between all of the really beautiful visualizations in JavaScript and the interactivity allowed by JavaScript, and really make a strong seamless connection to R so you can get the best of both worlds. Thomas, however, he hates JavaScript. I’m pretty sure he’s wrong, but it does mean he’s less likely to work on the next iteration of R+JavaScript. But I do get where he’s coming from because one of the big problems with JavaScript visualization is how do you turn that into a publishable artifact? Which is still what so many people want to do, and it’s easier than ever to publish a Shiny app that’s fundamentally interactive, but that’s still nowhere as easy as something like a static HTML document or PDF. So how do you harness the best of static visualization and the best of interactive visualization? I think that’s the fundamental challenge, and I don’t have a good sense of how to do it yet, though I hope in 20 years' time I will have figured it out (laughs).

But I guess when I think about 20 years ago, like, visualization hasn’t changed that much in the last 20 years. My general thesis of visualization is that the quality of the best visualization has maybe improved 10% in the last 150 years. The best visualization you can make today is only slightly better than the best visualization someone could make 150 years ago. But the time it takes you to make them has probably decreased by three orders of magnitude. So I don’t see the best visualizations getting much better, but I see the number of people making high-quality visualizations growing, and the ease at which you can do that growing. I hope that more people continue to express visualizations in a programming language, rather than a point and click tool. That’s my hope for the next 20 years, that it becomes a given that code is the language of data science and data visualization.