Hadley Wickham











is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular Hadley Wickham is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular ggplot2 software for data visualization and a contributor to the Ggobi project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization.





Which term applies to you: data scientist, statistician, computer

scientist, or something else?





I’m an assistant professor of statistics, so I at least partly

associate with statistics :). But the idea of data science really

resonates with me: I like the combination of tools from statistics and

computer science, data analysis and hacking, with the core goal of

developing a better understanding of data. Sometimes it seems like not

much statistics research is actually about gaining insight into data.



You have created/maintain several widely used R packages. Can you

describe the unique challenges to writing and maintaining packages

above and beyond developing the methods themselves?





I think there are two main challenges: turning ideas into code, and

documentation and community building.



Compared to other languages, the software development infrastructure

in R is weak, which sometimes makes it harder than necessary to turn

my ideas into code. Additionally, I get less and less time to do

software development, so I can’t afford to waste time recreating old

bugs, or releasing packages that don’t work. Recently, I’ve been

investing time in helping build better dev infrastructure; better

tools for documentation [roxygen2], unit testing [testthat], package development [devtools], and creating package website [staticdocs]. Generally, I’ve

found unit tests to be a worthwhile investment: they ensure you never

accidentally recreate an old bug, and give you more confidence when

radically changing the implementation of a function.



Documenting code is hard work, and it’s certainly something I haven’t

mastered. But documentation is absolutely crucial if you want people

to use your work. I find the main challenge is putting yourself in the

mind of the new user: what do they need to know to use the package

effectively. This is really hard to do as a package author because

you’ve internalised both the motivating problem and many of the common

solutions.



Connected to documentation is building up a community around your

work. This is important to get feedback on your package, and can be

helpful for reducing the support burden. One of the things I’m most

proud of about ggplot2 is something that I’m barely responsible for:

the ggplot2 mailing list. There are now ggplot2 experts who answer far

more questions on the list than I do. I’ve also found github to be

great: there’s an increasing community of users proficient in both R

and git who produce pull requests that fix bugs and add new features.



The flip side of building a community is that as your work becomes

more popular you need to be more careful when releasing new versions.

The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN

packages, and forced me to rethink my release process. Now I advertise

releases a month in advance, and run `R CMD check` on all downstream

dependencies (`devtools::revdep_check` in the development version), so

I can pick up potential problems and give other maintainers time to

fix any issues.



Do you feel that the academic culture has caught up with and supports

non-traditional academic contributions (e.g. R packages instead of

papers)?





It’s hard to tell. I think it’s getting better, but it’s still hard to

get recognition that software development is an intellectual activity

in the same way that developing a new mathematical theorem is. I try

to hedge my bets by publishing papers to accompany my major packages:

I’ve also found the peer-review process very useful for improving the

quality of my software. Reviewers from both the R journal and the

Journal of Statistical Software have provided excellent suggestions

for enhancements to my code.



You have given presentations at several start-up and tech companies.

Do the corporate users of your software have different interests than

the academic users?





By and large, no. Everyone, regardless of domain, is struggling to

understand ever larger datasets. Across both industry and academia,

practitioners are worried about reproducible research and thinking

about how to apply the principles of software engineering to data

analysis.



You gave one of my favorite presentations called Tidy Data/Tidy Tools

at the NYC Open Statistical Computing Meetup. What are the key

elements of tidy data that all applied statisticians should know?





Thanks! Basically, make sure you store your data in a consistent

format, and pick (or develop) tools that work with that data format.

The more time you spend munging data in the middle of an analysis, the

less time you have to discover interesting things in your data. I’ve

tried to develop a consistent philosophy of data that means when you

use my packages (particularly plyr and ggplot2), you can focus on the

data analysis, not on the details of the data format. The principles

of tidy data that I adhere to are that every column should be a

variable, every row an observation, and different types of data should

live in different data frames. (If you’re familiar with database

normalisation this should sound pretty familiar!). I expound these

principles in depth in my in-progress [paper on the

topic].



How do you decide what project to work on next? Is your work inspired

by a particular application or more general problems you are trying to

tackle?





Very broadly, I’m interested in the whole process of data analysis:

the process that takes raw data and converts it into understanding,

knowledge and insight. I’ve identified three families of tools

(manipulation, modelling and visualisation) that are used in every

data analysis, and I’m interested both in developing better individual

tools, but also smoothing the transition between them. In every good

data analysis, you must iterate multiple times between manipulation,

modelling and visualisation, and anything you can do to make that

iteration faster yields qualitative improvements to the final analysis

(that was one of the driving reasons I’ve been working on tidy data).



Another factor that motivates a lot of my work is teaching. I hate

having to teach a topic that’s just a collection of special cases,

with no underlying theme or theory. That drive lead to [stringr] (for

string manipulation) and [lubridate] (with Garrett Grolemund for working

with dates). I recently released the [httr] package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.



What do you see as the biggest open challenges in data visualization

right now? Do you see interactive graphics becoming more commonplace?





I think one of the biggest challenges for data visualisation is just

communicating what we know about good graphics. The first article

decrying 3d bar charts was published in 1951! Many plots still use

rainbow scales or red-green colour contrasts, even though we’ve known

for decades that those are bad. How can we ensure that people

producing graphics know enough to do a good job, without making them

read hundreds of papers? It’s a really hard problem.



Another big challenge is balancing the tension between exploration and

presentation. For explotary graphics, you want to spend five seconds

(or less) to create a plot that helps you understand the data, while you might spend

five hours on a plot that’s persuasive to an audience who

isn’t as intimately familiar with the data as you. To date, we have

great interactive graphics solutions at either end of the spectrum

(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one

end of the spectrum to the other. This summer I’ll be spending some

time thinking about what ggplot2 + [d3], might

equal, and how we can design something like an interactive grammar of

graphics that lets you explore data in R, while making it easy to

publish interaction presentation graphics on the web.