Over the past several weeks I have been helping students, career professionals, and people of other backgrounds learn R. During this time one this has become apparent, people are teaching the old paradigm of R and avoiding the tidyverse all together.

I recently had a student reach out to me in need of help with the first programming assignment from the Coursera R-Programming course (part of the Johns Hopkins Data Science Specialization). This particular student was struggling with combining the her new knowledge of R data types, conditional statements, looping, control statements, scoping, and functions to solve the assignment problem set. I provided her with a walk through of each question in base R, the style of the course. I couldn’t help but empathize with her as I too learned the long way first. However I thought that she shouldn’t be learning the hard way first (see David Robinson’s blog post, “Don’t teach students the hard way first”), she should be learning the effective way.

In my written response to her, I gave her solutions to her problems in base R and using the tidyverse. Here, I will go over the problems and adress them from a tidy perspective. This will not serve as a full introduction to the tidyverse. For an introduction and a reason why the tidyverse is superior to base R, I leave you with Stat 545: Introduction to dplyr

The assignment utilizes a directory of data called specdata which can be downloaded here, and describes it:

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

Date : the date of the observation in YYYY-MM-DD format (year-month-day)

: the date of the observation in format (year-month-day) sulfate : the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)

: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter) nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and create the directory ‘specdata’. Once you have unzipped the zip file, do not make any modifications to the files in the ‘specdata’ directory. In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

Part I Problem 1: Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows pollutantmean <- function(directory, pollutant, id = 1:332){ ## 'directory' is a character vector of length 1 indicating ## the location of the CSV files ## 'pollutant' is a character vector of length 1 indicating ## the name of the pollutant for which we will calculate the ## mean; either "sulfate" or "nitrate" ## 'id' is an integer vector indicating the monitor ID numbers ## to be used ## Return the mean of the pollutant across all monitors list ## in the 'id' vector (ignoring NA values) ## NOTE: Do not round the result! } Before we tackle the function, I believe the best approach is to first solve the problem in a regular script. This problem has four clear steps: Identify files in the directory Subset files based on provided ID Read the files Calculate and return the mean on the desired column This problem gives us a directory of files from which we need to read in the data based on the provided IDs. For the sake of this walk through we will randomly sample 10 values within the range designated in the problem statement (332). We will first generate random IDs, then identify all of the files within the specified directory and obtain their file paths using the list.files() function. After this we will subset our file list based on the IDs, then iterate over our file list and read in each file as a csv using purrr:map_df() combined with readr::read_csv() . Fortunately map_df() returns a nice and pretty data frame which lets us avoid having to explicitly bind each unique data frame. Identify Files Here we create 10 random IDs and store them in the ids variable. Next we use list.files() to look within the specdata folder that we downloaded above. Everyone’s path will most likely be different. Be sure to obtain the correct file path—help for Mac. Next we identify the files we need based on the sampled ids and store the subset in the files_filtered variable. We use the values of the ids to locate the file paths positionally. For example, ID number 1 is the first file, number 10 is the tenth, etc. # Load our handy dandy functions library(tidyverse) # 10 random IDs in ID range ids <- sample(1:332, 10) # Identify all files within the directory files <- list.files("../../data/specdata", full.names = TRUE) # Subset the data files_filtered <- files[ids] # View the files to verify paste(ids, files_filtered) Reading the Files Now that we have identified the files that we are going to read in, we can use purrr:map_df() to apply the readr::read_csv() function to each value of files_filtered and return a data frame (hence the _df() suffix). We supply additional arguments to read_csv() to ensure that every column is read in properly. # Read in the subset of the data. Notice the brackets after files[] specdata <- map_df(files_filtered, read_csv, col_types = list( col_date(), col_double(), col_double(), col_integer() )) glimpse(specdata) Next, we get to utilize some dplyr magic. Here we take the specdata object we created from reading in our files, deselct the Date column, then utilize summarise_if() to apply the mean() function to our data. summarise_if() requires that we provide a logical statement as the first argument. If (hence the _if() suffix) the logical statement evaluates to TRUE on a column then it will apply a list of functions to those columns where the statement evaluated to TRUE . We can also specify additional arguments to the functions. Here we specify na.rm = TRUE for handling missing values. In this case, we are checking to see if our columns are of the data type double using the is.double() function. If you’re wondering why we didn’t use is.numeric() , it’s because the ID column is an integer which is considered numeric. If we wanted to take the underlying vector of one of the columns, we can also, use dplyr::pull(col_name) . This will be helpful later when we want to obtain the mean of just one column. # Obtain mean nitrate specdata %>% select(-Date) %>% summarise_if(is.double, mean, na.rm = TRUE) %>% # Pull just the sulfate column pull(sulfate) specdata %>% select(-Date) %>% summarise_if(is.double, mean, na.rm = TRUE) %>% # Pull just the nitrate column pull(nitrate) Now that we have all of the tools, we can put this together into a single function, which I will call pollutant_mean() to somewhat adhere—functions should take the name of verbs—to the tidyverse style guide. Here we have three arguments: directory : Where to look for the files

: Where to look for the files pollutant : Which pollutant (nitrate or sulfate) to evaluate This needs to be a character value unless we want to get into tidyeval, which frankly I will leave to the professionals. But I will provide an alternative solution at the end that doesn’t require quoted pollutant names.

: Which pollutant (nitrate or sulfate) to evaluate id : Which monitoring stations we should look at Within the function we take everything we did in the above steps but generalize it to a function. We identify the files in the directory provided ( specdata ), subset the files positionally based on the provided id vector, and then iterate over the file names and read them in with map_df() and read_csv() . Next we take our data and calculate the mean on both sulfate and nitrate columns. We then pull() the specified column from the pollutant argument and then return() that value. pollutant_mean <- function(directory, pollutant, id = 1:332) { files <- list.files(directory, full.names = TRUE) files_filtered <- files[id] specdata <- map_df(files_filtered, read_csv, col_types = list( col_date(), col_double(), col_double(), col_integer() )) specdata %>% select(-Date) %>% summarise_if(is.double, mean, na.rm = TRUE) %>% pull(pollutant) %>% return() } Here we can test out the function with both types of pollutants and different id values. pollutant_mean(directory = "../../data/specdata", pollutant = "sulfate", id = sample(1:332, 20)) #pollutant_mean("../../data/specdata", "nitrate", 2)

Part II: Let us continue to the second problem in the problem set. Problem 2: Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. The assignment provides an example function format, but I think it to be a bit misleading. So I will go about this in the way I think is best. We will work on creating a function called complete_spec_cases() which will take only two arguments, directory , and id . directory and id will be used in the the same way as the previous problem. For this problem our goal is to identify how many complete cases there are by provided ID. This should be exceptionally simple. We will have to identify our files, subset them, and read them in the same way as before. Next we can identify complete cases by piping our specdata object to na.omit() which will remove any row with a missing value. Next, we have to group by the ID column and pipe our grouped data frame to count() which will count how many observations there are by group. We will then return this data frame to the user. complete_spec_cases <- function(directory, id = 1:332) { files <- list.files(directory, full.names = TRUE) specdata <- map_df(files[id], read_csv, col_types = list( col_date(), col_double(), col_double(), col_integer() )) complete_specdata <- specdata %>% na.omit() %>% group_by(ID) %>% summarise(nobs = n()) return(complete_specdata) } complete_spec_cases("../../data/specdata", id = sample(1:332, 20))