New features and improvements in Deedle v1.0

As Howard Mansell already announced on the BlueMountain Tech blog, we have officially released the "1.0" version of Deedle. In case you have not heard of Deedle yet, it is a .NET library for interactive data analysis and exploration. Deedle works great with both C# and F#. It provides two main data structures: series for working with data and time series and frame for working with collections of series (think CSV files, data tables etc.)

The great thing about Deedle is that it has been becoming a foundational library that makes it possible to integrate a wide range of diverse data-science components. For example, the R type provider works well with Deedle and so does F# Charting. We've been also working on integrating all of these into a single package called FsLab, but more about that next time!

In this blog post, I'll have a quick look at a couple of new features in Deedle (and corresponding R type provider release). Howard's announcement has a more detailed list, but I just want to give a couple of examples and briefly comment on performance improvemens we did.

What's new in Deedle?

Perhaps the most visible difference in the new version is that many of the functions are renamed. We thought that before v1.0, we had a unique chance to get the naming right, so we did a lot of renamings to make sure that everything is consistent. For example, some functions used series and some column, some used sort and others order and so on. This should now be cleaned up. Similarly, we fixed a number of mismatches between Series and Frame modules.

Additions to Deedle API

Aside from renaming, we also added a couple of useful functions. For example, the homepage sample compares survival ration for different passenger classes. This can now be done even more easily using PivotTable :

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: #load "Deedle.1.0.0/Deedle.fsx" open Deedle let titanic = Frame . ReadCsv ( "../data/titanic.csv" ) // Pivot operation using "Sex" as row // and "Survived" as a new column titanic |> Frame . pivotTable ( fun _ row -> row . GetAs < string > ( "Sex" )) ( fun _ row -> row . GetAs < bool > ( "Survived" )) Frame . countRows // The same operation using method notation titanic . PivotTable < string , bool , _ > ( "Sex" , "Survived" , Frame . countRows )

The operation groups the rows according to the two keys and then performs aggregation using the specified function (here Frame.countRows ). This is a common operation and so we wanted to make it as simple as possible. We also continue to expose operations both as F# functions in modules and as C#-friendly methods.

Another example where we did lot of improvements is statistics:

1: 2: 3: 4: let msft = Frame . ReadCsv < DateTime > ( "../data/msft.csv" , "Date" ) msft ? Open |> Stats . movingVariance 100 msft ? Open |> Stats . expandingMean msft ? Open |> Stats . kurt

The first improvement is that you can now specify key column when loading data from a CSV file (again, this is very common). The same feature is available when loading data from a sequence of .NET objects using Frame.ofRows .

The next new thing is the Stats module. This is the new place for all functions related to statistics and numerical computations. We found that adding more functions to Series and Frame modules was a bit confusing, so we moved all statistical functions in one place. This is even more important now that we added more functions (kurtosis, skewness, variance) and we added more ways to calculate them (moving and expanding windows). For more information see the frame and series statistics page.

Improved documentation

Finally, one of the strong points of Deedle is that it has an excellent documentation. This is now even more the case, because we polished the documentation automatically generated from Markdown comments in the source code. In particular, for the three core modules:

Series module provides functions for working with individual data series and time-series values. This includes operations such as sampling, transformations, data access and more.

module provides functions for working with individual data series and time-series values. This includes operations such as sampling, transformations, data access and more. Frame module `provides functions that are similar to those in the Series module, but operate on entire data frames. You can transform, align and join frames, perform various re-indexing operations etc.

module `provides functions that are similar to those in the module, but operate on entire data frames. You can transform, align and join frames, perform various re-indexing operations etc. Stats module implements standard statistical functions (mean, variance, kurtosis, skewness, etc.) over series, moving windows, expanding windows and a lot more. The module contains functions for both series and frames.

What's new in the R provider?

Together with a new release of Deedle, we also updated the R type provider. There are a couple of improvements that make it work a lot better:

The installation from NuGet does no longer rely on PowerShell installation script, so it can work on Mono and when using the "Restore Packages" feature.

The type provider communicates with R via a separate process, so it is more stable and it will also let us call 64bit version of R.

These are technical, but very important improvements. However, we also added one nice new feature that makes it even easier to mix R and F#!

RData type provider

In R, you can save workspaces (environments) into *.rdata files. This is useful if you want to archive results of some interactive analysis done in the R environment. But, wouldn't it be nice if you could do some data analysis in R and then save the data to a file and load it easily from F# in a type-safe way?

This is exactly what you get with the RData type provider! Let's say that I have cars.rdata file containing the mtcars data set (saved under the name cars ) together with a list mpg and a value mpgMean . I can write:

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: #load "RProvider.1.0.9/RProvider.fsx" open RProvider let file = new RData < "../data/cars.rdata" > () // Calculate mean in R and in F# let mean1 = file . mpg |> Array . average let mean2 = file . mpgMean . [ 0 ] // Average mpg based on cylinder count file . cars |> Frame . groupRowsByInt "cyl" |> Frame . getCol "mpg" |> Stats . levelMean fst |> Series . observations

If you look at the types, you'll see that file.mpg is of type float[] and file.cars is of type Frame<string, string> . The R type provider uses the installed plugins (like the Deedle plugin) to find the most appropriate F# type for exposing the data and so the R data frame cars is automatically exposed as Deedle frame.

This lets us quickly group the values by "cyl" (number of cylinders) and then calculate average miles per gallon "mpg" for each of the groups. Using F# Charting, the result looks like this:

Deedle performance improvements

In this release of Deedle, we spent some time on improving the performance. The first version was designed with performance in mind and the internals make it possible to implement operations efficiently (e.g. in F#, it is quite easy to write code so that the data is stored in continuous memory blocks). However, there were a number of places where some Deedle function just used the "simplest stupid way to get things done".

This was nice, because it let us quickly build a sophisticated and easy to use API, but there were cases where things were just too slow. So, improving performance is an ongoing effort and if you find a use case where Deedle is slow, please submit an issue!

Measuring performance

To make sure we can monitor the performance, I created a fairly simple tool that lets us measure performance automatically. This is currently available in my branch. The tool is started via a FAKE script and it measures the performance of all tests in a specified file. The tests also serve as unit tests. For example:

1: 2: 3: [< Test ; PerfTest ( Iterations = 10 )>] let ``Merge 3 unordered 300k long series (repeating Merge)`` () = r1 . Merge ( r2 ) . Merge ( r3 ) . KeyCount |> shouldEqual 900000

The PerfTest attribute specifies that the function is a performance test and it also lets us specify number of iterations (so that we run quick tests repeatedly, but slow tests only a few times).

Absolute performance

I did two simple analyses of the performance. The first chart compares the new version of Deedle with the previous version available on NuGet:

• v0.9.12 (November 2013)

• v1.0.0 (May 2014)

The numbers represent the total number of milliseconds needed to run the test. Note that the X axis is limited to 10 seconds, but some of the tests actually take longer using the old version. Also, some tests only have value when using the new version - this is because they are using function that is new in v1.0.

A couple of points worth mentioning:

Some of the notable improvements are when merging series - this also applies to joining of frames (e.g. when applying numerical operations). We also added overload of Merge on frames that can merge multiple series at once, which is significantly faster (and lets you merge e.g. 1000 frames, which was previously too slow).

There is a number of improvements in Resample operations. Again, this is just an example of a more general speedup (that also affects windowing and chunking functions).

Relative performance

In the previous chart, it is a bit difficult to see what is the greatest performance improvement. In the following chart, the tests are scaled so that the performance using original version (0.9.12) is used as 100% and the performance using the new version is shown as a percentage (so cutting 10sec down to 5sec shows as 50%)

Again, you can see a number of interesting things:

The biggest speedup is on "Accessing float series via object series". This is the case when you access a column on a frame using df.Columns (which returns a series of ObjectSeries<'K> values). Because we do not know the type of individual columns, we return them as series containing obj values. In the new version, this does not actually box the values and so converting the series back to Series<'K, float> is essentially no-op.

We also did some work on improving grouping (and related) operations, so, for example the homepage sample is now about twice as fast. There is still (a lot of) room for improvement, but as you can see, we're working hard on this!

The joining and merging operations are about 6x faster, but for Merge this is even more significant when you're merging multiple frames.

The tests that I included here are by no means comprehensive. They simply represent a couple of test cases that I was working on. However, with the performance measurements in place, we should be able to use this more and more often! So, if you have an interesting use case, submit a pull request adding a performance test!

Summary

The "1.0" release of Deedle is an important milestone. Although Deedle has been around since November (and it has been used internally by BlueMountain), the "1.0" release means that the library is becoming more stable and ready for others to adopt.

Of course, there is always room for improvement. There are operations that could be faster (please report them!), there are functions that should be added (please suggest them!) and there are likely a few remaining bugs. I marked some issues as up-for-grabs in case you wanted to contribute directly.

Another important thing about Deedle is that it is a foundational component around which we can build an awesome .NET data science stack. If you're interested, register at www.fslab.org and follow this blog for more information.

There are many people who contributed to Deedle (and R provider), but the projects wouldn't exist without Howard Mansell and Adam Klein at BlueMountain. A lot of the R provider work has been done by David Charboneau. Thanks!

namespace System

type Environment =

static member CommandLine : string

static member CurrentDirectory : string with get, set

static member Exit : exitCode:int -> unit

static member ExitCode : int with get, set

static member ExpandEnvironmentVariables : name:string -> string

static member FailFast : message:string -> unit + 1 overload

static member GetCommandLineArgs : unit -> string[]

static member GetEnvironmentVariable : variable:string -> string + 1 overload

static member GetEnvironmentVariables : unit -> IDictionary + 1 overload

static member GetFolderPath : folder:SpecialFolder -> string + 1 overload

...

nested type SpecialFolder

nested type SpecialFolderOption



Full name: System.Environment

property System.Environment.CurrentDirectory: string

Multiple items

namespace FSharp



--------------------

namespace Microsoft.FSharp

namespace FSharp.Charting

val shouldEqual : a:'a -> b:'b -> unit



Full name: Deedle-v1.shouldEqual

val a : 'a

val b : 'b

Multiple items

type TestAttribute =

inherit Attribute

new : unit -> TestAttribute



Full name: Deedle-v1.TestAttribute



--------------------

new : unit -> TestAttribute

Multiple items

type Attribute =

member Equals : obj:obj -> bool

member GetHashCode : unit -> int

member IsDefaultAttribute : unit -> bool

member Match : obj:obj -> bool

member TypeId : obj

static member GetCustomAttribute : element:MemberInfo * attributeType:Type -> Attribute + 7 overloads

static member GetCustomAttributes : element:MemberInfo -> Attribute[] + 15 overloads

static member IsDefined : element:MemberInfo * attributeType:Type -> bool + 7 overloads



Full name: System.Attribute



--------------------

Attribute() : unit

Multiple items

type PerfTestAttribute =

inherit Attribute

new : Iterations:int -> PerfTestAttribute



Full name: Deedle-v1.PerfTestAttribute



--------------------

new : Iterations:int -> PerfTestAttribute

val Iterations : int

Multiple items

val int : value:'T -> int (requires member op_Explicit)



Full name: Microsoft.FSharp.Core.Operators.int



--------------------

type int = int32



Full name: Microsoft.FSharp.Core.int



--------------------

type int<'Measure> = int



Full name: Microsoft.FSharp.Core.int<_>

namespace Deedle

val titanic : Frame<int,string>



Full name: Deedle-v1.titanic

Multiple items

module Frame



from Deedle



--------------------

type Frame =

static member ReadReader : reader:IDataReader -> Frame<int,string>

static member CustomExpanders : Dictionary<Type,Func<obj,seq<string * Type * obj>>>

static member NonExpandableInterfaces : List<Type>

static member NonExpandableTypes : HashSet<Type>



Full name: Deedle.Frame



--------------------

type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =

interface IDynamicMetaObjectProvider

interface INotifyCollectionChanged

interface IFsiFormattable

interface IFrame

new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>

private new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> -> Frame<'TRowKey,'TColumnKey>

member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> -> unit

member AddColumn : column:'TColumnKey * series:seq<'V> -> unit

member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit

member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit

...



Full name: Deedle.Frame<_,_>



--------------------

new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>

static member Frame.ReadCsv : path:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<int,string>

static member Frame.ReadCsv : stream:IO.Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<int,string>

static member Frame.ReadCsv : path:string * indexCol:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<'R,string> (requires equality)

val pivotTable : rowGrp:('R -> ObjectSeries<'C> -> 'RNew) -> colGrp:('R -> ObjectSeries<'C> -> 'CNew) -> op:(Frame<'R,'C> -> 'T) -> frame:Frame<'R,'C> -> Frame<'RNew,'CNew> (requires equality and equality and equality and equality)



Full name: Deedle.Frame.pivotTable

val row : ObjectSeries<string>

member ObjectSeries.GetAs : column:'K -> 'R

member ObjectSeries.GetAs : column:'K * fallback:'R -> 'R

Multiple items

val string : value:'T -> string



Full name: Microsoft.FSharp.Core.Operators.string



--------------------

type string = String



Full name: Microsoft.FSharp.Core.string

type bool = Boolean



Full name: Microsoft.FSharp.Core.bool

val countRows : frame:Frame<'R,'C> -> int (requires equality and equality)



Full name: Deedle.Frame.countRows

static member FrameExtensions.PivotTable : frame:Frame<'R,'C> * r:'C * c:'C * op:Func<Frame<'R,'C>,'T> -> Frame<'R,'C> (requires equality and equality)

member Frame.PivotTable : r:'TColumnKey * c:'TColumnKey * op:(Frame<'TRowKey,'TColumnKey> -> 'T) -> Frame<'R,'C> (requires equality and equality and equality and equality)

val msft : Frame<DateTime,string>



Full name: Deedle-v1.msft

Multiple items

type DateTime =

struct

new : ticks:int64 -> DateTime + 10 overloads

member Add : value:TimeSpan -> DateTime

member AddDays : value:float -> DateTime

member AddHours : value:float -> DateTime

member AddMilliseconds : value:float -> DateTime

member AddMinutes : value:float -> DateTime

member AddMonths : months:int -> DateTime

member AddSeconds : value:float -> DateTime

member AddTicks : value:int64 -> DateTime

member AddYears : value:int -> DateTime

...

end



Full name: System.DateTime



--------------------

DateTime()

(+0 other overloads)

DateTime(ticks: int64) : unit

(+0 other overloads)

DateTime(ticks: int64, kind: DateTimeKind) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : unit

(+0 other overloads)

DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : unit

(+0 other overloads)

type Stats =

static member count : frame:Frame<'R,'C> -> Series<'C,int> (requires equality and equality)

static member count : series:Series<'K,'V> -> int (requires equality)

static member expandingCount : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingKurt : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingMax : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingMean : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingMin : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingSkew : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingStdDev : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member expandingSum : series:Series<'K,float> -> Series<'K,float> (requires equality)

...



Full name: Deedle.Stats

static member Stats.movingVariance : size:int -> series:Series<'K,float> -> Series<'K,float> (requires equality)

static member Stats.expandingMean : series:Series<'K,float> -> Series<'K,float> (requires equality)

static member Stats.kurt : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)

static member Stats.kurt : series:Series<'K,float> -> float (requires equality)

namespace RProvider

val file : obj



Full name: Deedle-v1.file

val mean1 : obj



Full name: Deedle-v1.mean1

type Array =

member Clone : unit -> obj

member CopyTo : array:Array * index:int -> unit + 1 overload

member GetEnumerator : unit -> IEnumerator

member GetLength : dimension:int -> int

member GetLongLength : dimension:int -> int64

member GetLowerBound : dimension:int -> int

member GetUpperBound : dimension:int -> int

member GetValue : [<ParamArray>] indices:int[] -> obj + 7 overloads

member Initialize : unit -> unit

member IsFixedSize : bool

...



Full name: System.Array

val average : array:'T [] -> 'T (requires member ( + ) and member DivideByInt and member get_Zero)



Full name: Microsoft.FSharp.Collections.Array.average

val mean2 : obj



Full name: Deedle-v1.mean2

val groupRowsByInt : column:'C -> frame:Frame<'R,'C> -> Frame<(int * 'R),'C> (requires equality and equality)



Full name: Deedle.Frame.groupRowsByInt

val getCol : column:'C -> frame:Frame<'R,'C> -> Series<'R,'V> (requires equality and equality)



Full name: Deedle.Frame.getCol

static member Stats.levelMean : level:('K -> 'L) -> series:Series<'K,float> -> Series<'L,float> (requires equality and equality)

val fst : tuple:('T1 * 'T2) -> 'T1



Full name: Microsoft.FSharp.Core.Operators.fst

Multiple items

module Series



from Deedle



--------------------

type Series =

static member ofNullables : values:seq<Nullable<'a0>> -> Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)

static member ofObservations : observations:seq<'a0 * 'a1> -> Series<'a0,'a1> (requires equality)

static member ofOptionalObservations : observations:seq<'K * 'a1 option> -> Series<'K,'a1> (requires equality)

static member ofValues : values:seq<'a0> -> Series<int,'a0>



Full name: Deedle.FSharpSeriesExtensions.Series



--------------------

type Series<'K,'V (requires equality)> =

interface IFsiFormattable

interface ISeries<'K>

new : pairs:seq<KeyValuePair<'K,'V>> -> Series<'K,'V>

new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>

new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder -> Series<'K,'V>

member After : lowerExclusive:'K -> Series<'K,'V>

member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> -> Series<'TNewKey,'R> (requires equality)

member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> -> Series<'TNewKey,'R> (requires equality)

member AsyncMaterialize : unit -> Async<Series<'K,'V>>

member Before : upperExclusive:'K -> Series<'K,'V>

...



Full name: Deedle.Series<_,_>



--------------------

new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> -> Series<'K,'V>

new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>

new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder -> Series<'K,'V>

val observations : series:Series<'K,'T> -> seq<'K * 'T> (requires equality)



Full name: Deedle.Series.observations

val r1 : Series<int,float>



Full name: Deedle-v1.r1

val series : observations:seq<'a * 'b> -> Series<'a,'b> (requires equality)



Full name: Deedle.FSharpSeriesExtensions.series

val r2 : Series<int,float>



Full name: Deedle-v1.r2

val r3 : Series<int,float>



Full name: Deedle-v1.r3

val ( Merge 3 unordered 300k long series (repeating Merge) ) : unit -> unit



Full name: Deedle-v1.( Merge 3 unordered 300k long series (repeating Merge) )

member Series.Merge : [<ParamArray>] otherSeries:Series<'K,'V> [] -> Series<'K,'V>

member Series.Merge : otherSeries:seq<Series<'K,'V>> -> Series<'K,'V>

member Series.Merge : otherSeries:Series<'K,'V> -> Series<'K,'V>

member Series.Merge : another:Series<'K,'V> * behavior:UnionBehavior -> Series<'K,'V>