Our world is generating more and more data, which people and businesses want to turn into something useful. This naturally attracts many data scientists – or sometimes called data analysts, data miners, and many other fancier names – who aim to help with this extraction of information from data.

A lot of data scientists around me graduated in statistics, mathematics, physics or biology. During their studies they focused on individual modelling techniques or nice visualizations for the papers they wrote. Nobody had ever taken a proper computer science course that would help them tame the programming language completely and allow them to produce a nice and professional code that is easy to read, can be re-used, runs fast and with reasonable memory requirements, is easy to collaborate on and most importantly gives reliable results.

I am no exception to this. During my studies we used R and Matlab to get a hands-on experience with various machine learning techniques. We obviously focused on choosing the best model, tuning its parameters, solving for violated model assumptions and other rather theoretical concepts. So when I started my professional career I had to learn how to deal with imperfect input data, how to create a script that can run daily, how to fit the best model and store a predictions in a database. Or even to use them directly in some online client facing point.

To do this I took the standard path. Reading books, papers, blogs, trying new stuff working on hobby projects, googling, stack-overflowing and asking colleagues. But again mainly focusing on overcoming small ad hoc problems.

Luckily for me, I’ve met a few smart computer scientists on the way who showed me how to develop code that is more professional. Or at least less amateurish. What follows is a list of the most important points I had to learn since I left the university. These points allowed me to work on more complex problems both theoretically and technically. I must admit that making your coding skills better is a never ending story that restarts with every new project.

1. Parameters, constants and functions

You are able to easily re-use your code if you make it applicable to similar problems as well. A simple wisdom that is however quite tricky to apply in practice. Your building blocks here are parameters, constants and functions.

Parameters enable you to change important variables and settings in one place. You should never have anything hard-coded in the body of your code. Constants help you to define static variables that cannot be altered. Constants are useful for example when you need to compare strings.

library(caret) library(futile.logger) #' constants DATASET_IRIS <- 'iris' DATASET_MTCARS <- 'mtcars' IRIS_TARGET <- 'Sepal.Length' MTCARS_TARGET <- 'mpg' MODELLING_METHOD_RF <- 'random forest' MODELLING_METHOD_GBM <- 'gradient boosting machine' #' parameters DATASET <- DATASET_IRIS MODELLING_METHOD <- MODELLING_METHOD_GBM #' load data flog.info(paste0('Loading ', DATASET, ' dataset')) if (DATASET == DATASET_IRIS){ data(iris) df <- iris target_variable <- IRIS_TARGET } else if (DATASET == DATASET_MTCARS){ data(mtcars) df <- mtcars target_variable <- MTCARS_TARGET } #' create formula modelling_formula <- as.formula(paste0(target_variable, '~.')) #' train model flog.info(paste0('Fitting ', MODELLING_METHOD)) if(MODELLING_METHOD == MODELLING_METHOD_RF){ set.seed(42) my_model <- caret::train(form=modelling_formula, data=df, method='rf') } else if(MODELLING_METHOD == MODELLING_METHOD_GBM){ set.seed(42) my_model <- caret::train(form=modelling_formula, data=df, method='gbm', verbose=FALSE) } my_model

Functions are key ingredients of programming. Always put the repetitive tasks in your code into functions. These functions should always aim to perform one task and be general enough to be used for similar cases. How general typically depends on what you want to achieve.

Even helper functions should be well documented. The absolute minimum is to summarize what the function should do and what is the meaning of input parameters. I usually use roxygen comments so that the function can be later used in an R package without much extra work. For more details please see here.

#' Calculates Root Mean Squeared Error #' #' @param observed vector with observed values #' @param predicted vector with predicted values #' @return numeric f_calculate_rmse <- function(observed, predicted){ error <- observed - predicted return(round(sqrt(mean(error^2)), 2)) }

You have to test the functions you are writing anyway so it is a good idea to automate this step in case you would like to update the functions in the future. This is important especially if you plan to wrap you functions in a package. Nice way to do this is using testthat package. Here is a nice page how to run your tests automatically.

library(testthat) library(Metrics) #' testing of f_calculate_rmse test_that('Root Mean Square Error', { #' create some data n <- 100 observed <- rnorm(n) predicted <- rnorm(n) my_rmse <- f_calculate_rmse(observed=observed, predicted=predicted) #' same results as Metrics::rmse expect_equal(my_rmse, Metrics::rmse(actual=observed, predicted=predicted), tolerance=.05) #' output is numeric and non-negative expect_that(my_rmse, is_a("numeric")) expect_that(my_rmse >= 0, is_true()) })

Obviously one does not need to write all the functions needed. A great advantage of R is that there are so many functions available in thousands of available libraries. To make sure you will not run into namespace problems when two loaded libraries both contain a function with the same name, specify the package you want to use by packagename::functionname() . An example is the summarise function when both plyr and dplyr packages are loaded.

library(plyr) library(dplyr) #' load data data(mtcars) #' see what happens summarise(group_by(mtcars, cyl), n=n()) plyr::summarise(group_by(mtcars, cyl), n=n()) dplyr::summarise(group_by(mtcars, cyl), n=n())

2. Style

You will be reading your code again in the future so be nice to yourself (and anyone else who will have to read it) and have a consistent coding style. A lot of people use Google’s R style or Hadley Wickham’s style.

Here I also need to stress that it is important to comment your code. Especially when you consider your solution brilliant and obvious. Also please do not be afraid of long but self-explanatory function and variable names.

3. Version control

Always use version control for your projects. It will save you a lot of nerves. It has so many advantages. Here is a nice summary of them. The most important to me are

ability to revert back to previous versions of my code

clean project folder because I can delete anything without fear

easy to invite colleagues to collaborate on the project

Using git is easy. Especially from RStudio.

4. Development

Development doesn’t necessarily need to be a messy process.

Your code is not working? Then you need to be able to quickly locate the problem and fix it. Luckily, RStudio has a lot of built-in debugging tools so that you can stop the code at the point where you suspect the problem is arising, and look at and/or walk through the code, step-by-step at that point.

#' create some data set.seed(42) n <- 100 observed <- rnorm(n) predicted <- rnorm(n) #' debug the function debug(f_calculate_rmse) f_calculate_rmse(observed=observed, predicted=predicted)

Each programming language has its own strengths and weaknesses that you need to keep in mind. You don’t want your code to run too slow or use too much memory. A handy tool for this is profiling. Again, RStudio comes with a solution to this. Profiling enables you to detect where the execution of you code lasts the longest and where it uses the most memory. Do not rely on your intuition when optimizing your code! You should also check how the running time and memory requirements increase with the size of data. This will give you an idea for what data can be your code used and what could be the consequences for scaling.

library(profvis) #' profiling of f_calculate_rmse profvis({ set.seed(42) n <- 1e5 observed <- rnorm(n) predicted <- rnorm(n) f_calculate_rmse(observed=observed, predicted=predicted) })

During the development you will encounter many problems. Each time this happens you should improve the error handling in your code and raise a self-explanatory warning or error. Especially mind the data types and missing parameters.

5. Deployment

Deployment means that your code will need to run automatically. Or at least without you executing it line by line. In this case it is very helpful to know what is going on and whether the execution went well without any problems. For this purpose I use futile.logger package. It is a light solution and enables me to log the execution of my codes both to screen or file. I just need to write understandable messages in the correct places in my code.

library(futile.logger) #' logging setup flog.threshold(DEBUG) # level of logging flog.appender(appender.file('foo.log')) # log to file #' logging flog.info('Some info message') flog.debug('Some debug message') flog.warn('Some warning message') flog.error('Some error message')

Automated code execution is typically done by Cron scheduler using Rscript foo.r . This command runs the foo.r code. Very often you want to specify some parameters of the script so that you can analyse different data, specify which machine learning method to use, if you want to retrain the model and so on. For this I use the argparse package. Following code enables my to specify the csv with input data in command line: Rscript my_code.r -if latest_data.csv .

library(argparse) #' default parameters INPUT_FILE_DEFAULT <- 'input.csv' #' create parser object parser <- ArgumentParser(description='My code') #' define arguments parser$add_argument('-if', '--input_file', default=INPUT_FILE_DEFAULT, type='character', help='Location of csv file with input data') #' get command line options args <- parser$parse_args() #' load data data <- read.csv(args$input_file)

6. Plotting

Data visualization is the “shop window” of analytics. Therefore you will probably spend a lot of time fine-tuning each plot. Good best practice is the following.

define style, color palette and any other parameters in a separate script write a function to create a plot object use another function to either show the plot or save it to a file

Let’s see how it works in the following basic example.

library(ggplot2) data(iris) #' some basic style my_collors <- list('red'='#B22222') #' function to create histogram f_create_histogram <- function(df, column){ p <- ggplot(df, aes_string(x=column)) + geom_histogram(binwidth=.1, fill=my_collors$red) + ggtitle(paste0('Histogram of ', column)) return(p) } #' create plots sepal_length_hist <- f_create_histogram(iris, 'Sepal.Length') sepal_width_hist <- f_create_histogram(iris, 'Sepal.Width') #' show sepal_length_hist #' save ggsave('sepal_width_hist.png', plot=sepal_width_hist)

7. Reproducibility

Make sure your code is reproducible. Because a lot of data science steps involve random sampling or optimization, we need to make sure that we can repeat the code with the same results. That is why it is critical to use set.seed() function.

> set.seed(42); sample(LETTERS, 5) [1] "X" "Z" "G" "T" "O" > set.seed(42); sample(LETTERS, 5) [1] "X" "Z" "G" "T" "O" > sample(LETTERS, 5) [1] "N" "S" "D" "P" "W"

8. Combine tools

Once you become confident in R programming you tend to do everything in R. Please do not forget that there are many other tools available and thanks to connectors they can be used together with R. For example I very often combine R with Python or SQL databases.