dplyr: How to do data manipulation with R

Ok. Here’s an ugly secret of that data world: lots of your work will be prep work.

Of course, any maker, artist, or craftsman has the same issue: chefs have their mise en place. Carpenters spend a heck of a lot of time measuring vs. cutting. Etcetera.

So, you just need to be prepared that once you become a data scientist, 80% of your work will be data manipulation.

Getting data, aggregating data, subsetting data, cleaning it, and merging various datasets together: this constitutes a large percent of the day-to-day work of an analyst.

When you’re just starting out with analytics and data science, you can get away with doing only minimal data manipulation. In the beginning, your datasets are likely to be txt, csv, or simple Excel files.

And if you do need to do some basic data formatting, for simple datasets you can do your data manipulation in Excel. (actually, Excel is a good tool in your workflow for basic data manipulation tasks.)

As you progress though, you’ll eventually reach a bottleneck. You’ll start doing more sophisticated data visualizations or machine learning techniques, and you will need to put your data in the right format. And you’ll need a better toolset.

By most accounts, the best toolset for data manipulation with R is dplyr.

dplyr: the essential data manipulation toolset

In data wrangling, what are the main tasks?

– Filtering rows (to create a subset)

– Selecting columns of data (i.e., selecting variables)

– Adding new variables

– Sorting

– Aggregating

dplyr gives you tools to do these tasks, and it does so in a way that streamlines the analytics workflow. It’s not an exaggeration to say that dplyr is almost perfectly suited to real analytics work, as it is actually performed.

To be clear, these aren’t just the “basics.” They are the essentials. These are tasks that you’ll be doing every. single. day. You really need to master these.

Again though, dplyr makes them extremely easy. It’s the toolset that I wish I had years ago.

Moreover, once you combine dplyr verbs with “chaining” (covered below) it becomes even more streamlined and more powerful. Not to mention, chaining together the data wrangling tools of dplyr with the data visualization tools of ggplot. Once you start combining these together, you will have a powerful toolset for rapid data exploration and analysis.

dplyr verbs

dplyr has 5 main “verbs” (think of “verbs” as commands).

filter()

Row selection from your data.

filter() subsets your data by keeping rows that meet specified conditions.