R tutorial August 25, 2014

Hands-on dplyr tutorial for faster data manipulation in R

I love dplyr. It's my "go-to" package in R for data exploration, data manipulation, and feature engineering. I use dplyr because it saves me time: its performance is blazing fast on data frames, but even more importantly, I can write dplyr code faster than base R code. Its syntax is intuitive and its functions are well-named, and so dplyr code is easy-to-read even if you didn't write it.

dplyr is the "next iteration" of the plyr package (focusing data frames, hence the "d"), and released version 0.1 in January 2014. It's being developed by Hadley Wickham (author of plyr, ggplot, devtools, stringr, and many other R packages), so you know it's a well-written, well-documented package.

Teaching dplyr using an R Markdown document

As one of the instructors for General Assembly's 11-week Data Science course in Washington, DC, I had 30 minutes in class last week to talk about data manipulation in R, and chose to focus exclusively on dplyr. When putting together my presentation, I had a lot of great material to draw from:

I decided to create an R Markdown document to present from, since that would allow me to blend together R code, output, and explanatory text. You can view the rendered document on RPubs, or download the source document from GitHub. (Update: It has also been turned into a presentation, as well as translated into Indonesian!)

Using the hflights dataset (available on CRAN), I demonstrate the five basic dplyr "verbs," the chaining syntax, some of the more advanced functionality (such as window functions), a few of the new convenience functions that I find most useful (such as glimpse and summarise_each ), and how to query a database using dplyr. I also compare many of the dplyr commands to the equivalent commands in base R. (Thanks to Hadley, because many of the examples I use are ones he wrote!)

Watch the dplyr tutorial on YouTube

After presenting, I recorded the entire presentation as a YouTube video (embedded below), since I know it can be helpful to hear someone explaining code that is unfamiliar to you. It runs 39 minutes, but if you only want to watch a particular section, simply click the topic below and it will skip to that point in the video.

Introduction to dplyr (starts at 0:00) Loading dplyr and the example dataset (starts at 2:29) Understanding "local data frames" (starts at 3:23) Verb #1: filter (starts at 5:17) Verb #2: select , plus contains , starts_with , ends_with , matches (starts at 7:54) Using chaining syntax for more readable code (starts at 9:34) Verb #3: arrange (starts at 12:53) Verb #4: mutate (starts at 13:55) Verb #5: summarise , plus group_by , summarise_each , n , n_distinct , tally (starts at 15:31) Window functions: min_rank , top_n , lag (starts at 26:47) Convenience functions: sample_n , sample_frac , glimpse (starts at 32:44) Connecting to databases (starts at 34:21)

What topics did I miss? What do you like most about dplyr? Let me know in the comments!

P.S. In March 2015, I released a follow-up tutorial covering the new features in dplyr versions 0.3 and 0.4!