By Wei-Meng Lee

Wei-Meng Lee is a technologist and founder of Developer Learning Solutions (http://www.learn2develop.net), a technology company specializing in hands-on training on the latest Microsoft technologies. Wei-Meng speaks regularly at international conferences and is the author of ASP.NET 2.0: A Developer’s Notebook and Visual Basic 2005 Jumpstart (both from O'Reilly Media, Inc). Wei-Meng is currently a Microsoft Device Application Development MVP. Contact Wei-Meng at weimenglee@learn2develop.net.

In my previous article (Nov/Dec 2017 CODE Magazine), I talked about machine learning using Python and the Scikit-learn library. In addition to using Python for data science and machine learning, another language is very popular among data scientist and statisticians, and that's R. R is an open-source programming language and software environment for statistical computing and graphics. R is based on another language, S, created by John Chambers while he was at Bell Labs. The name R was partly due to the names of its two creators, Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. It was also partly because it was seen as a dialect of the S language.

Regardless of the history behind its name, R and its libraries implement a wide variety of statistical techniques, such as linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. Another core strength of R is graphics, which can produce publication-quality graphs. All of these qualities make R a dream language for statisticians and data scientists.

In this article, I'll start out with an introduction to the R language so that you can get up to speed quickly. In the next article, I'll dive into the various libraries in R that you use for machine learning.

Trying Out R

To try out R, you have a number of options. First, if you followed my previous article on using Python with Scikit-learn, you installed Anaconda (https://www.anaconda.com/download/). (If you didn't already do this, please follow the “Installing Anaconda” sidebar's link.) Although the Anaconda installation only comes with Python support by default, you could easily add R support in Anaconda (https://conda.io/docs/user-guide/tasks/use-r-with-conda.html) by running a simple command in Terminal. To install R in Anaconda, type the following command in Terminal and follow the on-screen instructions:

$ conda install r-essentials

The above command installs the libraries for R in your Anaconda installation. Once this is done, you can launch Jupyter Notebook. Doing so brings up the development environment using your Web browser:

$ jupyter notebook

The above command launches the Web browser. To start an R session, click New > R (see Figure 1).



Figure 1: Create a new notebook for R.

You should now see the familiar notebook, as shown in Figure 2.



Figure 2: You are now ready for some R action!

Another popular editor for running R code is RStudio (https://www.rstudio.com). I'll be using Jupyter for this article.

Basic Language Syntax

R is a dynamically typed language, meaning that variables need not be pre-declared with a specific data type. Rather, variables take on whatever type is necessary, based on the value assigned to them. The following statements show some examples:

num1 <- 5.5 6 -> num2 print(num1) # 5.5 print(num2) # 6 num2 = "Two" print(num2) # "Two"

In R, the assignment operator is <- or -> (although the usual = operator is also supported).

To check the data type of variables, use the typeof() function:

print(typeof(num1)) # "double" print(typeof(num2)) # "character"

You can also perform multiple assignments in a single statement, like this:

num2 = 6 num4 <- num3 <- num2 print(num3) # 6 print(num4) # 6

One common misconception when dealing with string variables is to assume that the length() function returns the length of the string, as the following example illustrates:

str = "This is a string" print(str) # "This is a string" print(length(str)) # 1

Interestingly, the length() function returns a 1 for the above example. This is because the length() function returns the length of vectors (the section on Vectors later in this article covers this more fully). In R, every variable is also of the type vector. Think of a vector as an array in a typical conventional programming language. So, in the above example, length(str) actually returns the number of items in the str vector, which is 1. To get the length of a string variable, use the nchar() function, like this:

print(nchar(str)) # 16

Using Functions in R

In R, you can get more information about a specific function by using the print() function. For example:

print(exp) # function (x) .Primitive("exp")

The above code statement shows the exp() function, which takes in a single argument and returns a primitive result. Here's another example:

print(log) # function (x, base = exp(1)) .Primitive("log")

The log() function takes in two arguments. The second argument has a default value of exp(1) and the function returns a primitive result.

You can now see how to call the log() function using the various combinations of arguments:

print(log(10)) # 2.302585 print(log(10, base=exp(1))) # 2.302585 print(log(10, base=10)) # 1 print(log(10, 10)) # 1 print(log(base=exp(1), x=10)) # 2.302585 print(log(base=exp(1), 10)) # 2.302585

Note that you can swap the order of the arguments if you specify the argument names. This is very useful as it makes the function calls much more self-explanatory. They are also some scientific and mathematical functions in R:

print(sin(90)) # 0.8939967 print(cos(180)) # -0.5984601 print(tan(270)) # -0.1788391 print(factorial(6)) # 720 print(round(3.14)) # 3 print(round(3.145, 2)) # 3.15

Defining Your Own Functions

To define your own function, you can use the function keyword and then assign it to a function name, like the following:

myFunction <- function(n,m = 6) { result = n * m result + 5 # OR return (result + 5) }

In the above example, myFunction takes two arguments: n and m. The parameter m is known as the default parameter, which has a default value of 6 when you don't supply it when calling the function. Note that the last statement in a function is used as the return value, so essentially the return keyword is optional. I prefer to use the return keyword, as this makes the function clearer. The following statements show how to call the function, the first with one argument and the second with two arguments:

print(myFunction(5)) # 35 print(myFunction(5,7)) # 40

Making Decisions

To make decisions, R uses the familiar if-else statement construct. The following shows an example of a function that determines if a number is an odd number:

isodd <- function(n) { if (n %% 2 == 0) { # %% is modulus FALSE } else { TRUE } } isodd(46) # FALSE isodd(45) # TRUE

If you're one who indulges in terse coding, the above isodd() function can be rewritten as a single statement:

isodd <- function(n) {(n %% 2 != 0)}

R supports the usual arithmetic, relational, and logical operators:

+ (addition), - (subtraction), * (multiplication), / (division), ^ (power), %% (modulo)

> (greater than), < (lesser than), == (equality), <= (lesser or equal to), >= (greater than or equal to), != (not equal to)

&& (logical AND), || (logical OR), ! (logical NOT)

Vectors

As briefly mentioned earlier, everything in R is a vector. Think of a vector as an array of items of the same data type.

x <- c(3,4,5,6,7) print(typeof(x)) # double print(length(x)) # 5

In that snippet, x is a vector comprised of five items of type double. You can append additional items into the vector using the c() function:

x <- c(x, 8) # append another item to x print(x) # 3 4 5 6 7 8

If you append an item of a different type to the existing vector, R attempts to convert all items in the vector into a common type, as the following example shows:

x <- c(x, "9") # append another item to x print (x) # "3" "4" "5" "6" "7" "8" # "9" x <- c(x, TRUE) print(x) # "3" "4" "5" # "6" "7" "8" # "9" "TRUE"

If you want to have a collection of items with different types, use a list instead of a vector, like this:

y <- list(3,4,5,6,7,"9",TRUE)

Vector Functions

In R, there are a number of vector functions that make manipulating numbers easy. Here's an example:

nums <- c(12,34,56,9,45,67,90,11,2,45) print(min(nums)) # 2 print(max(nums)) # 90 print(mean(nums)) # 37.1 print(median(nums)) # 39.5 print(sd(nums)) # 28.90002 print(sort(nums)) # 2 9 11 12 34 45 45 56 # 67 90 print(sum(nums)) # 371 print(unique(nums)) # 12 34 56 9 45 67 90 # 11 2

As you can see, given a vector containing numbers, it's very easy to get information from the vector using the various vector functions such as min(), max(), etc. In particular for data science, you can use the summary() function to get a summary of the numbers contained in the vector:

print(summary(nums)) # Min. 1st Qu. Median Mean 3rd Qu. # Max. # 2.00 11.25 39.50 37.10 53.25 # 90.00

Dealing with NAs

A lot of times, you load data from files, in particular, CSV or tab-separated files. These data files may contain “holes” in them, meaning missing data for some rows and columns. When loading missing values from a CSV file, R automatically replaces them with NAs (for Not Available). One good use of the summary() function is to show the number of NAs in a vector so that you can decide if you need to replace or omit them from the vector before you do any further processing. The following code example shows the summary() function displaying the number of NAs in the vector:

nums <- c(12,34,56,9,45,67,90,11,2,45, NA) print(summary(nums)) # Min. 1st Qu. Median Mean 3rd Qu. # Max. NA's # 2.00 11.25 39.50 37.10 53.25 # 90.00 1

To omit the NAs in your vector, use the na.omit() function:

print(sum(na.omit(nums))) # omit the NAs in # the vector and # sum up the rest

If you want to know whether each element in the vector is a NA, use the is.na() function:

print(is.na(nums)) # FALSE FALSE FALSE # FALSE FALSE FALSE # FALSE FALSE FALSE # FALSE TRUE

To extract all the numbers in a vector that isn't an NA, use the subset() function:

nums <- subset(nums, is.na(nums) == FALSE) print(nums) # 12 34 56 9 45 67 90 11 2 45

Sequencing

When performing data science operations, you often need to generate a sequence of numbers. Instead of creating a vector of numbers manually, it would be easier to be able to generate the sequence automatically. The following code snippet generates a sequence from 1 to 3:

i <- 1:3 # generate a sequence from # 1 to 3 print(i) # 1 2 3

You can also generate a sequence in the reverse order:

j <- 5:1 print(j) # 5 4 3 2 1

You can also use the seq() function to generate a sequence of numbers:

y <- seq(9) print(y) # 1 2 3 4 5 6 7 8 9

The seq() function is useful when you want to specify an increment for the sequence, like the following:

y <- seq(from=2, to=9, by=2) print(y) # 2 4 6 8

You can also specify the length of the sequence as well as the starting and ending number and the seq() function automatically divides the numbers equally:

y <- seq(from=1, to=1.9, length=20) print(y) # [1] 1.000000 1.047368 1.094737 1.142105 # 1.189474 1.236842 1.284211 1.331579 # [9] 1.378947 1.426316 1.473684 1.521053 # 1.568421 1.615789 1.663158 1.710526 # [17] 1.757895 1.805263 1.852632 1.900000

If you want to generate a sequence of identical numbers, use the rep() (for repeat) function:

zeros <- rep(0, time=20) print(zeros) # 0 0 0 0 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0 0 0

The rep() function can also be used to repeat a vector, as the following demonstrates:

cars <- c("Suzuki", "Toyota", "Mercedes") manycars <- rep(cars, each=2) print(manycars) # "Suzuki" "Suzuki" "Toyota" "Toyota" # "Mercedes" "Mercedes"

Vector Indexing

You can use a sequence as an index into a vector to retrieve the items you want, as the following shows:

x <- c(3,4,5,6,7) i <- 1:3 print(x[i]) # 3 4 5

To omit an item in a vector, specify the index of the item to omit and prefix it with a negative sign, like this:

print(x[-1]) # 4 5 6 7 # excludes the first item print(x[-2]) # 3 5 6 7 # excludes the second item print(x[-length(x)]) # 3 4 5 6 # excludes the last item

To get the last n items from a vector, use the tail() function, like this:

print(tail(x,1)) # 7 print(tail(x,2)) # 6 7

You can also specify a range of items to extract from a vector, like this:

print(x[-2:-3]) # 3 6 7 # excludes second through # third items print(x[1:length(x)-1]) # 3 4 5 6 # excludes the last item

You can also specify conditions, like this:

print(x[x > 5]) # 6 7 print(x[x %% 2 == 0]) # 4 6

Looping

R supports looping constructs that are commonly found in other programming languages. The following example shows the for loop in action:

# function to print the first n numbers of the # fibonacci numbers fib <- function(n) { x <- c(1,1) for (i in 2:(n-1)) { x <- c(x, sum(tail(x,2))) } return (x) } print(fib(8)) # 1 1 2 3 5 8 13 21

You can also use the while loop in R; in the following example, I used it to create a Fibonacci sequence up to the number specified:

# function to print the fibonacci sequence until # the n numbers specified fib <- function(n) { x <- c(1,1) while (tail(x,1) < n) { x <- c(x, sum(tail(x,2))) } return (x) } print(fib(13)) # 1 1 2 3 5 8 13

Data Frames

A lot of times, data is represented in tabular format. This is something that R excels in. Date Frames are extremely useful in data science operations as data is often stored in CSV files or Excel spreadsheets. Loading the data into data frames allows you to manipulate the data using rows and columns.

Creating a Data Frame from Vectors

The following code snippet shows how a data frame (think of it as a table) is created from three vectors:

# column 1 c1 = c(2, 3, 5) # column 2 c2 = c("aaa", "bbb", "ccc") # column 3 c3 = c(TRUE, FALSE, TRUE) # create a data frame using the 3 columns df = data.frame(c1,c2,c3) print(df)

Printing the data frame produces the following output:

c1 c2 c3 1 2 aaa TRUE 2 3 bbb FALSE 3 5 ccc TRUE

Changing the Column Names of a Data Frame

Note that the column names take on the names of the three vectors by default. You can change this by specifying the column name explicitly:

df = data.frame(col1 = c1, col2 = c2, col3 = c3) print(df)

The above changes produce the following output with the new column names:

col1 col2 col3 1 2 aaa TRUE 2 3 bbb FALSE 3 5 ccc TRUE

Extracting Columns

To print out a specific column, use the index of the column (remember, index in R starts with 1, not 0):

print(df[2]) # second column

The above prints out this:

col2 1 aaa 2 bbb 3 ccc

You can also print out a column using its column name, like this:

print(df["col3"]) ' col3 1 TRUE 2 FALSE 3 TRUE '

Note that the preceding prints out a data frame containing only the third column. If you want the values of the third column as a vector, you can use the following syntax:

print(df$col3) # same as print(df[,"col3"]) ' [1] TRUE FALSE TRUE '

Extracting Rows

To print out a specific row in a data frame, specify its row index:

print(df[2,]) ' col1 col2 col3 2 3 bbb FALSE '

To print out a specific item in a data frame, specify its index and column number:

print(df[2,3]) ' [1] FALSE '

Transposing a Data Frame

You can also transpose a data frame using the t() function:

print(t(df)) ' [,1] [,2] [,3] col1 "2" "3" "5" col2 "aaa" "bbb" "ccc" col3 " TRUE" "FALSE" " TRUE" '

The t() function converts rows to columns and columns to rows.

Subsetting Data Frames

To create a subset of a data frame based on certain criteria, use the subset() function:

true_col = subset(df, col3 == TRUE) print(true_col) ' col1 col2 col3 1 2 aaa TRUE 3 5 ccc TRUE '

The previous snippet retrieves a subset of the df data frame based on the value of the “col3” column. The statements in this next snippet filters the result based on column names:

true_col = subset(df, col3 == TRUE, select = c(col1, col2)) print(true_col) ' col1 col2 1 2 aaa 3 5 ccc '

Creating a Data Frame from Files

For data science work, most of the time you create a data frame directly from a file, such as a CSV, or tab-separated file. Suppose you have the following content saved as a file named fruits.csv:

orange,pineapple,durian 2,3,4 4,5,2 5,3,1 3,2,5 6,8,10

The following code snippet reads the above CSV file into a R data frame and prints it out:

fruits <- read.csv(file = "fruits.csv") print(fruits) ' orange pineapple durian 1 2 3 4 2 4 5 2 3 5 3 1 4 3 2 5 5 6 8 10 '

Sometimes the CSV files might not be located locally but resides on a Web server. In this case, you can download it as a file before reading it into a data frame:

download.file(url = "http://bit.ly/2iJjdpb", destfile = "crimerecords.csv") crimerecords <- read.csv(file = "crimerecords.csv")

The preceding prints out the output, as shown in Figure 3. The head() function returns the first n rows of the data frame, which by default is the first six rows. Likewise, if you want to print the last n rows of the data frame, use the tail() function.



Figure 3: The output of the data frame loaded from a URL

Plotting Charts Using Data Frame

One of the key features of R is its strong graphic capabilities. Using R, you can directly plot graphs and charts. In the following sections, I'll discuss how to plot some interesting charts in R.

Plotting Bar Charts

Suppose you have a CSV file named public_transport.csv containing the following content:

Gender,Age,Mode,Times male,25,bus,20 female,28,train,10 male,35,bicycle,20 male,23,bus,7 female,43,bus,24 female,19,train,16 male,41,bus,28 female,12,bicycle,10 male,32,bus,19 female,29,train,19 female,11,bus,7 male,22,train,8 female,26,train,23 female,27,train,31 male,37,train,32 male,31,train,22

That CSV file contains a listing of commuters and their age, their mode of transport and how many times they used the specified mode of transport every month. The following code snippet first loads the CSV file into a data frame:

freq <- read.csv(file = "public_transport.csv") # count the total occurences of each "Mode" of # transportation

Suppose you want to know the number of people using bicycles, buses, and trains. To do this, you can use the table() function, which will help you tabulate the frequencies of each occurrence of transport mode:

freqMode <- table(freq$Mode) print(freqMode) ' bicycle bus train 2 6 8 '

Using the result returned by the table() function, you can plot a bar chart using the barplot() function:

barplot(freqMode, main="Main Mode of Transportation", xlab="Modes", ylab="Total")

The main argument specifies the title of the chart, while the xlab and ylab arguments specify the x-axis and y-axis labels respectively. The bar chart created is shown in Figure 4.



Figure 4: Plot a bar chart showing the use of the various modes of transport.

You can also alter the density of the bars by specifying the density argument:

barplot(freqMode, main="Main Mode of Transportation", xlab="Modes", ylab="Total", border="blue", density=c(10,20,30,40,50))

The density argument is a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The bar chart now looks like Figure 5.



Figure 5: Change the shading of the bars

Besides loading a data frame from a CSV file, you can also load it from a tab-separated file, such as the following example of a file named fruits.txt that contains the sales of the various fruits for each month from January to May:

orange pineapple durian 2 3 4 4 5 2 5 3 1 3 2 5 6 8 10

The following code snippet loads the content of the file into a data frame:

fruits_data <- read.table("fruits.txt", header=T, sep="\t")

Next, plot a bar chart showing the sales of oranges for each of the five months:

barplot(fruits_data$orange, main="Sales of Oranges", xlab="Months", ylab="Total", names.arg=c("Jan","Feb","Mar","Apr","May"))

The names.arg argument is a vector of names to be plotted below each bar or group of bars. The bar chart is shown in Figure 6.



Figure 6: Plot the sales of oranges for each month

You can change the colors of the bars (see Figure 7) by using the col argument:

fruits_data <- read.table("fruits.txt", header=T, sep="\t") barplot(fruits_data$orange, main="Sales of Oranges", xlab="Months", ylab="Total", names.arg=c("Jan","Feb","Mar", "Apr","May"), col=rainbow(5))

The col argument is a vector of colors for the bars or bar components. You can use the rainbow() function to generate a set of colors. For example, the rainbow(5) statements generates a vector containing the following five elements: “#FF0000FF”, “#CCFF00FF”, “#00FF66FF”, “#0066FFFF”, and “#CC00FFFF”.



Figure 7: Change the colors of the bars.

What about displaying the sales of the various fruits for each month? You can first convert the data frame into a matrix using the as.matrix() function:

print(as.matrix(fruits_data)) ' orange pineapple durian [1,] 2 3 4 [2,] 4 5 2 [3,] 5 3 1 [4,] 3 2 5 [5,] 6 8 10 '

And then plot the bar chart using the matrix:

fruits_data <- read.table("fruits.txt", header=T, sep="\t") barplot(as.matrix(fruits_data), main="Sales of Fruits from Jan to May", xlab="Months", ylab="Total", beside=TRUE, col=rainbow(5))

The beside argument specifies whether the columns of height are portrayed as stacked bars (FALSE), or the columns are portrayed as juxtaposed bars (TRUE). Figure 8 shows the bar chart with the beside argument set to TRUE.



Figure 8: Plot the sales of all the fruits for each month.

Figure 9 shows the bar chart with the beside argument set to FALSE.



Figure 9: Stack the bars of each month for each of the fruits.

You can display a legend on your chart using the legend() function:

fruits_data <- read.table("fruits.txt", header=T, sep="\t") barplot(as.matrix(fruits_data), main="Sales of Fruits from Jan to May", xlab="Fruits", ylab="Total", beside=TRUE, col=rainbow(5)) legend("topleft", c("Jan","Feb","Mar","Apr","May"), cex=1.6, bty="n", fill=rainbow(5));

The cex argument specifies the size of the text to be used for the legend. The bty argument takes either “n” or “o”. Setting to “o” draws a rectangle around the legend and setting it to “n” means no rectangle is drawn.

Figure 10 shows the chart with the legend.



Figure 10: Display a legend with the bar chart

Plotting Histograms

A histogram is another type of chart that's very useful for showing the distribution of numerical data. Using the transportation CSV file that I discussed earlier, you can plot the distribution of the commuters' age using a histogram, as shown in the following code snippet:

freq <- read.csv(file = "public_transport.csv") histogram = hist(freq$Age, main ="Distribution of Age groups", xlab = "Age Range", ylab = "Total")

Figure 11 shows the histogram showing the distribution of the ages of the commuters.



Figure 11: Use a histogram to show the distribution of age among the commuters

Observe that R automatically breaks the data up into intervals. You can verify this by printing the breaks property:

print(histogram$breaks) ' [1] 10 15 20 25 30 35 40 45 '

Sometimes you want to have more control over the breaks, and you can indeed do so via using the breaks argument by passing it a sequence, like this:

histogram = hist(freq$Age, main ="Distribution of Age groups", xlab = "Age Range", ylab = "Total", breaks = seq(0,50, by=10))

The updated histogram looks like Figure 12.



Figure 12: Alter the breaks in the histogram

Plotting Scatter Plots

Scatter plots are useful for plotting data points on a horizontal and a vertical axis when attempting to show how much one variable is affected by another. Consider the following example CSV file, named rainfall.csv, containing the yield of a particular crop and the associated rainfall and average temperature for a particular year.

year,yield,rainfall,temperature 1963,60,8,56 1964,50,10,47 1965,70,11,53 1966,70,10,53 1967,80,9,56 1968,50,9,47 1969,60,12,44 1970,40,11,44

You could plot a scatter plot using the plot() function:

rainfall <- read.csv(file = "rainfall.csv") plot(rainfall[1:4])

Figure 13 shows the scatter plot.



Figure 13: Show the relationships among all the variables using a scatterplot.

Using the scatter plot, you can compare the relationships among the various factors, such as year, yield, rainfall, and temperature. Figure 14 shows how to read the scatter plot.



Figure 14: How to read a scatterplot

Plotting Pie Charts

A pie chart displays a circle divided into slices to illustrate numerical proportion. In R, you can display a pie chart using the pie() function. Consider the following code snippet, which has a vector containing the market share of operating systems:

os <- c(63.99, 32.03, 1.48, 1.14, 0.84, 0.51) pie(os)

The pie chart created is shown in Figure 15.



Figure 15: Displaying a pie chart

You could supply additional information to make the pie chart more descriptive by adding labels:

pie(os, main="Mobile OS", col=rainbow(length(os)), labels=c("Android","iOS","Windows Phone", "Java ME","Symbian", "Others"))

Figure 16 shows the pie chart that is color-coded, with labels representing each slice of the pie.



Figure 16: Color-code the pie chart

As you saw earlier, you can generate a vector of color using the rainbow() function. In addition to this function, you can use the various other color palettes in R:

heat.colors(n)

terrain.colors(n)

topo.colors(n)

cm.colors(n)

The following code snippet shows a more detailed pie chart (see Figure 17) displaying the percentage of each slice as well as displaying a legend:

colors <- terrain.colors(6) # calculate the percentage os_labels <- round(os/sum(os) * 100, 1) # concat a % after each value os_labels <- paste(os_labels, "%", sep="") pie(os, main="Mobile OS", col=colors, labels=os_labels, cex=0.8) legend(-1.1, 1, c("Android","iOS","Windows Phone", "Java ME","Symbian", "Others"), cex=0.8, fill=colors)



Figure 17: Show a legend for the pie chart and percentages for each slice

Besides the palette of colors, you can also create your own sets of grey tones:

colors <- c("white","grey70","grey90", "grey50","grey75","black")

Figure 18 shows the pie chart using different shades of grey.



Figure 18: Display the slices using shades of grey

Summary

In this article, you had a whirlwind tour of R. Although this isn't an attempt to teach you everything about R, I do believe getting acquainted with the language can make you ready to embrace machine learning. In addition to the language basics, you also learned how to visualize data by using the various functions in R to plot charts, such as pie charts, bar charts, histograms, and scatterplots.