I have been talking about how great dplyr is when it comes to every day data analysis.

This holds true even when it comes to working with Date and Time data. Not only dplyr is great, but also there is another package called ‘lubridate’ that is designed to make it ridiculously easy and simple to work with date and time data within dplyr framework.

Today, I’m going to use stock price data, which I extracted from Yahoo Finance by using quantmod package, and demonstrate how easy and powerful to use dplyr and lubridate for every day data analysis for time series data. I’m using Exploratory Desktop, but you will find an R script to reproduce all the data wrangling steps used in this post at the end.

Import Data

I have written about how you can get stock price data by using quantmod package before. I have followed the same steps to get the stock price data for Apple, Amazon, Google, Facebook, Microsoft, and Twitter. I have shared this data with all the steps I’m going to demonstrate in this post as an EDF (Exploratory Data Format) file so you can download and import it into your Exploratory Desktop and see how the data is prepared.

Anyway, after a quick clean up of the data from Yahoo Finance through ‘quantmod’ package, I have data that looks like this.

As you can see there are historical stock price and the transactional volume data for each company. We are going to use ‘Adjusted’ column, which has adjusted prices before and after the stock split events, as the stock price for this demo.

Visualize with Line Chart

We can quickly visualize this in Chart view like below.

However, there is a slight problem. The ‘date’ column is actually ‘character’ type in this data so I can’t change the date/time aggregation level to something like Year, Month, Quarter, etc, in Chart view.

1. Convert Text to Date

Luckily, I can quickly convert this column to Date data type by using one of the text-to-date conversion functions called ‘ymd’ from ‘lubridate’ package. It parses the text data based on the assumption that the data should be read in a ‘year-month-day’ order, which is the case for this data like below.

If the data happens to be something like ‘01–13–2007’ (Month-Day-Year) then you can use ‘mdy’ function. If it is something like ‘01–13–2007 12:25’ then there is ‘mdy_hm’ function for that! ;) Does your data look like ‘January, 13th in 2007’ which has some text in between? You can still use ‘mdy’ function, which would ignore those text letters. Yes, it’s that simple! ;)

Anyway, selecting ‘Year, Month, Day’ from the column header menu in Table view will generate a command like below.

mutate(date = ymd(date))

Once you hit ‘Run’ button, you will get the column to be registered as Date. Now you can go back to Chart view and switch the date aggregation level to Week, for example. Make sure you switch the aggregation function to Average (Mean), otherwise all the values would be added up for each week, which we don’t want for this time.

Looks that the stock prices have moved a lot especially over the last few years. Let’s dig in more.

2. Last N Days, Months, Years of Data

Let’s say we want to see only for the last 2 years of data, instead of the whole data. You can select ‘N years ago’ from the column header drop down menu in Table (or Summary) view.

This will generate a command like below.

filter(date >= today() — years(1))

There are two super convenient functions from ‘lubridate’ package I’m using inside this ‘filter’ command. The first is ‘today’, which would literally return today’s date information in Date data type. The second is ‘years’, which would return a given number of years in Date / Time data type. We can update the number from 1 to 2 inside ‘years’ function like below so that we can get the last 2 years of the data.

filter(date >= today() — years(2))

After hitting ‘Run’ button and going back to Chart view, we can see only the last 2 years of data like below.

Note that I have used ‘today’ function because the underlying data I’m working with is Date data type. If it was POSIXct, which is Date and Time data type in R, then I would have had to use ‘now’ function instead, which would have returned today’s date and time in POSIXct format.

3. Compared to the beginning of the period

Now, as you can see, Amazon and Google’s stock prices are too big compared to other companies in the chart above, therefore it is hard to compare the performance among all the companies together. One way to address this is to set a baseline. Let’s say we are interested in how these stocks have performed since the beginning of this period. We want to subtract the price at the beginning of this period for each company from all the stock prices for each company.

This is where window functions come in handy. We are going to use one of the functions called ‘first’ from dplyr, which would return the first value of a given column within a given group. Let’s take a look at how we can do step by step.

First, we want to set the grouping level to the companies (symbol) because we want the first value of each company, not the entire data.

group_by(symbol)

Then, we want to make sure that the data is sorted based on the date column because we want the first value to be the beginning of the time period.

arrange(date)

Now, we can use ‘first’ function to get the first ‘adjusted’ stock price for each company and subtract it from each stock price.

mutate(growth = Adjusted - first(Adjusted))

This will give us a chart like this after assigning this newly calculated column ‘growth’ to Y-Axis.

But these values are still the absolute stock price values. Now, we can calculate the percent change since the beginning based on this absolute change values like below.

mutate(growth = Adjusted - first(Adjusted), growth_percent = growth/first(Adjusted) * 100)

Note that I’m using ‘growth’ column that has just been created in the same ‘mutate’ command. Yes, that’s what you can do with dplyr! ;)

After we switch the Y-Axis column to this newly calculated column ‘growth_percent’ we get a chart like below.

We can clearly see Amazon is a big winner while Twitter is a big loser, although this is based on the values that are compared to 2 years ago today. So it might not matter at all for the cases like where you invested in these companies on different times. You buy low and sell high, right? ;)

Calculate the difference from a particular date

Now, here is one funky thing you can do with R. There is something called ‘predicate’, which might sound familiar to you if you are coming from other programming languages. Basically it lets you create a condition for a particular column to pick values. For example, let’s say you invested in some of the stocks on 5/24/2015, and you are only interested in how much these companies stocks have grown since that particular date. In this case, you want to get the stock prices on 5/24/15 for all these companies first, then calculate the growth just like the way we did above. Let’s go step by step.

We can create an expression like below to get the price for each company.

mutate(base_price = Adjusted[date == "2015-05-15"])

And this will give you a chart like this.

Each company has its own price that is repeated throughout the time period. Once we have this price as baseline, then all we need to do is to calculate the difference between each stock price and this baseline price for each company, and we can do something like below.

mutate(base_price = Adjusted[date == "2015-05-15"],

growth_from_base = Adjusted - base_price,

growth_percent_from_base = growth_from_base / base_price * 100)

And once we assign this newly created column ‘growth_percent_from_base’ column to Y-Axis we get a chart like below.

4. 50 Days Moving / Rolling Average

Now, let’s say we want to calculate 50 days moving average of the adjusted stock prices so that we can see the trend over the price change better.

We can do this by using one of the ‘rolling’ (or moving) functions called ‘roll_mean’ from ‘roll_rcpp’ package.

This ‘roll_rcpp’ package from Kevin Ushey is actually super amazing and deserves a lot more credits and attension. Not only it gives a clear and consistent interface to all the rolling (moving) functions starting with ‘roll_’ but also they are super fast thanks to the implementation in the C++ layer.

I’d highly recommend you take a look at the original reference doc provided by Kevin. We have a section for these functions in our document as usual.

Anyway, we can use ‘roll_mean’ function like below.

mutate(moving_average = roll_mean(Adjusted, 50, align="right", fill=0))

I’m setting 50 days of the moving average, and setting ‘align’ argument to “right” so that the ‘moving average’ calculation will be done based on the previous 50 days, instead of the next 50 days.

After running the command and switching to this newly created column ‘moving_average’ for Y-Axis, we can see the chart like below.

Note that I have switched the aggregation level to ‘Day’ for X-Axis. Also, you can see that the data is showing 0 for the first 50 days since I have set ‘align’ argument to ‘right’ and ‘fill’ to 0.

When you compare it to a chart below with the original values, you can see the lines are much smoother with ‘moving average’.

5. Find Abnormal Changes by Day

This is the last one, but it’s pretty cool. ;)

Once in a while, stock prices go ups or downs a lot based on the news about the companies. Let’s say we want to find which stocks on which days we had such events — abnormal price change points — over the last two years.

First, we can get the daily price changes by using another window function called ‘lag’, which would return the data in the previous row in a given group. In this case, the grouping is still set to ‘symbol’.

So ‘lag’ function will return the previous value within the each company (symbol) group.

mutate(diff = Adjusted - lag(Adjusted))

Now, based on this ‘diff’ value, we can calculate the percent change like below.

mutate(diff = Adjusted - lag(Adjusted), percent_diff = diff / lag(Adjusted) * 100)

Note that, again, I’m using ‘diff’ column, which is just created in the first column creation, in the second column creation already.

We can visualize this with Line chart like below.

As you can see there are some days where some companies’s stock prices moved dramatically compared to the others. We can filter for only the price changes that are beyond our thresholds, let’s say more than 10% up or down. We can use ‘between’ function from dplyr package inside ‘filter’ command like below.

filter(!between(percent_diff, -10, 10))

Note that the exclamation mark ‘!’ reverses the effect of the function after. And, this is equivalent to the following.

filter(percent_diff > 10| percent_diff < -10)

Note that the vertical line ‘|’ means OR in R.

After changing the chart type to Bar, we can see only the bars for the days when some of the companies stock prices went up or down beyond the threshold values.

It looks that Twitter is the one that tend to go up or down dramatically more often than the others.

Your Turn!

I have shared this data with all the steps as an EDF (Exploratory Data Format) file so you can download and import it into your Exploratory Desktop and follow how it is done step by step.

If you don’t have Exploratory Desktop you can sign up from the website and download. It’s Free! ;)

If you like to reproduce it in a standalone R environment, here is the script.

# Custom R function as Data.

Stocks.func <- function(){

require(quantmod)

google <- rownames_to_column(data.frame(getSymbols("GOOG", auto.assign = FALSE)))

facebook <- rownames_to_column(data.frame(getSymbols("FB", auto.assign = FALSE)))

twitter <- rownames_to_column(data.frame(getSymbols("TWTR", auto.assign = FALSE)))

amazon <- rownames_to_column(data.frame(getSymbols("AMZN", auto.assign = FALSE)))

msft <- rownames_to_column(data.frame(getSymbols("MSFT", auto.assign = FALSE)))

apple <- rownames_to_column(data.frame(getSymbols("AAPL", auto.assign = FALSE)))

bind_rows(google, facebook,amazon, apple, twitter, msft)

}



# Set libPaths.

.libPaths("/Users/kannishida/.exploratory/R/3.3")



# Load required packages.

library(quantmod)

library(lubridate)

library(tidyr)

library(stringr)

library(RcppRoll)

library(dplyr)

library(exploratory)



# Data Analysis Steps

Stocks.func() %>%

exploratory::clean_data_frame() %>%

gather(measure, value, -rowname, na.rm=TRUE) %>%

separate(measure, into=c("symbol", "measure"), sep="\\.") %>%

spread(measure,value) %>%

rename(date = rowname) %>%

mutate(date = ymd(date)) %>%

filter(date >= today() - years(2)) %>%

group_by(symbol) %>%

arrange(date) %>%

mutate(growth = Adjusted - first(Adjusted), growth_percent = (Adjusted - first(Adjusted))/first(Adjusted)*100) %>%

mutate(moving_average = roll_mean(Adjusted, 50,align="right", fill=0)) %>%

mutate(diff = Adjusted - lag(Adjusted), percent_diff = diff /lag(Adjusted)) %>%

filter(percent_diff > 0.1 | percent_diff < -0.1)

That’s it for today. If you like this post please click the green heart button at the bottom or leave your feedback. Thanks!