Data Analysis in RUby

At my job, I started work on a client project that was rather number intensive. We were going to have to perform repetitive calculations over a dataset. A colleague introduced me to the concept of data frames and the Ruby gem Daru.

What is Daru?

It’s a data analysis tool that lets us build tabular data sets which we can then manipulate and apply linear calculations. It’s also is a data visualization tool. It’s worth noting that Daru is quite similar in functionality to the pandas Python library.

You can do alot with Daru, but I’ll just be looking at a very small piece of it here. See Resources section below for a link to the docs.

Vectors and DataFrames

Vector

A Vector is a one dimensional set of data, like an array. When I was working with Daru, I liked to think of a Vector as a single column from a spreadsheet. In this example, I am mocking 7 day price history of some imaginary product. Note that the Vector can be named. Also, note the numeric index.

require 'daru' prices = Array . new ( 7 ) { rand ( 499 .. 599 ) } vector = Daru :: Vector . new prices , name: :price_cents #=> #<Daru::Vector(7)> # price_cents # 0 505 # 1 544 # 2 552 # 3 501 # 4 534 # 5 516 # 6 541

DataFrame

A DataFrame on the other hand is two dimensional, like a spreadsheet. Expanding on the example above, we can include a date column. We instantiate the DataFrame with a hash of arrays. Each key of the hash is a column name and the array contains the data. This is one of several ways to compose a DataFrame. Another way is to use Vectors instead of arrays - Vector indicies are lined up with each other.

require 'daru' require 'date' prices = Array . new ( 7 ) { rand ( 499 .. 599 )} def dates dates = [] 7 . times { | index | dates << Date . new ( 2016 , 1 , index + 1 ) } dates end df = Daru :: DataFrame . new ( { date: dates , price_cents: prices }, name: :product_prices , order: [ :date , :price_cents ] ) #=> #<Daru::DataFrame: product_prices (7x2)> # date price_cent # 0 2016-01-01 501 # 1 2016-01-02 591 # 2 2016-01-03 543 # 3 2016-01-04 558 # 4 2016-01-05 528 # 5 2016-01-06 552 # 6 2016-01-07 558

So I have a DataFrame…Now what?

We can perform analysis on the data frame like finding the mean, counts, min, and max. Also we can find covariance and correlation bewtween Vectors. It’s also possible to perform SQL like queries against the data. There is also filtering and sorting…the list goes on and on. See the documentation for the details. Here, I’ll show a couple of examples that I found useful.

Add a rolling mean column

Here is an example of adding a column to the DataFrame that is a rolling mean calculation on the price column with a lookback of 7.

... df = Daru :: DataFrame . new ( { date: dates , price_cents: prices }, name: :product_prices , order: [ :date , :price_cents ] ) # Add a column that is a rolling mean of price with a lookback of 7. df [ :r_mean ] = df [ :price_cents ]. rolling_mean ( 7 ) p df #=> #<Daru::DataFrame: product_prices (14x3)> # date price_cent r_mean # 0 2016-01-01 573 nil # 1 2016-01-02 587 nil # 2 2016-01-03 511 nil # 3 2016-01-04 501 nil # 4 2016-01-05 548 nil # 5 2016-01-06 562 nil # 6 2016-01-07 567 549.857142 # 7 2016-01-08 503 539.857142 # 8 2016-01-09 503 527.857142 # 9 2016-01-10 565 535.571428 # 10 2016-01-11 580 546.857142 # 11 2016-01-12 567 549.571428 # 12 2016-01-13 540 546.428571 # 13 2016-01-14 514 538.857142

Do some arithmetic.

What if we want to calculate the price difference with a lookback of 7. Here is one way we could solve it with the lag method.

... # First we could add a column that is the price_cents column shifted by 7. df [ :prev_week_price ] = df [ :price_cents ]. lag ( 7 ) p df #=> #<Daru::DataFrame: product_prices (14x3)> # date price_cent prev_week_ # 0 2016-01-01 515 nil # 1 2016-01-02 567 nil # 2 2016-01-03 528 nil # 3 2016-01-04 593 nil # 4 2016-01-05 599 nil # 5 2016-01-06 563 nil # 6 2016-01-07 520 nil # 7 2016-01-08 579 515 # 8 2016-01-09 551 567 # 9 2016-01-10 575 528 # 10 2016-01-11 541 593 # 11 2016-01-12 529 599 # 12 2016-01-13 594 563 # 13 2016-01-14 573 520 # Now we can calculate the difference between values for each index. df [ :weekly_change ] = df [ :price_cents ] - df [ :prev_week_price ] p df #=> #<Daru::DataFrame: product_prices (14x4)> # date price_cent prev_week_ weekly_cha # 0 2016-01-01 590 nil nil # 1 2016-01-02 515 nil nil # 2 2016-01-03 507 nil nil # 3 2016-01-04 591 nil nil # 4 2016-01-05 556 nil nil # 5 2016-01-06 534 nil nil # 6 2016-01-07 524 nil nil # 7 2016-01-08 562 590 -28 # 8 2016-01-09 544 515 29 # 9 2016-01-10 553 507 46 # 10 2016-01-11 577 591 -14 # 11 2016-01-12 530 556 -26 # 12 2016-01-13 502 534 -32 # 13 2016-01-14 511 524 -13

Join DataFrames

Let’s say we want to compare the prices of two products. We could join two DateFrames together à la SQL.

... # Product 'a' prices df_a = Daru :: DataFrame . new ( { date: dates , price_a: prices_a }, name: :product_a_prices , ) # Product 'b' prices df_b = Daru :: DataFrame . new ( { date: dates , price_b: prices_b }, name: :product_b_prices , ) # We can join both DataFrames on the :date Vector df_joins = df_a . join ( df_b , on: [ :date ], how: :left ) #=> #<Daru::DataFrame(14x3)> # price_a date price_b # 0 541 2016-01-01 566 # 1 570 2016-01-02 578 # 2 561 2016-01-03 541 # 3 537 2016-01-04 535 # 4 594 2016-01-05 570 # 5 541 2016-01-06 586 # 6 536 2016-01-07 539 # 7 586 2016-01-08 586 # 8 596 2016-01-09 510 # 9 572 2016-01-10 565 # 10 536 2016-01-11 535 # 11 536 2016-01-12 574 # 12 549 2016-01-13 505 # 13 597 2016-01-14 538

Conclusion

Daru is a powerful data analysis tool which I have barely scratched the surface. Some other things of note:

Creating DataFrames from CSVs or Excel files

Grouping and aggregating data

Graceful handling of missing data (nils)

Pivot tables

Data visulization

There is much more to this library that what I have shown, so I encourage the reader to explore more. I have provided some links below to get started.

Resources