What is column-oriented storage?

Row-oriented storage and column-oriented storage are two major ways of laying out data in memory. In Rust, there is a simple way to think of this: row-oriented storage is like an array of structs, whereas column-oriented storage is like a struct of arrays. It’s easy to use row-oriented storage in Rust, so this post is going to explore column-oriented storage.

To make this post a bit more concrete, I’m going to work with this example struct:

struct Planet { mass_kg : f64 , radius_m : f64 , habitable : bool , }

When I create a Vec<Planet> , each Planet will be stored contiguously in memory; this is row-oriented storage. For most purposes, this is nice. For example, appending another Planet into my Vec usually only requires writing to a single region of memory.

In column-oriented storage, all values for each column will be stored contiguously. For Planet , this might look like:

struct PlanetFrame { mass_kg : Vec < f64 > , radius_m : Vec < f64 > , habitable : Vec < bool > , }

This can be useful for a few reasons:

Many analytics queries only require looking at a few fields. By accessing only the fields that we need, we reduce IO.

We should be able to leverage SIMD and CPU pipelining more easily.

We can store certain types in more efficient ways. For example, booleans can be stored using run-length encoding or a bit field.

Column-oriented storage is most prevalent in databases like Redshift and BigQuery and “big data” file formats like ORC and Parquet, but it can also be useful for in-memory data. Daniel Abadi has a great piece on this: Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?

The problem that I’m investigating here is: how can I take a struct like Planet and automatically transform it into a column-oriented version like PlanetFrame ?

What is frunk?

The frunk crate is a “generic functional programming toolbelt for Rust.” Its flagship data type is HList , which is short for “heterogenous list.” This is a kind of list where each element can have a different type, all of which are known at compile time. Here’s some example frunk code to give you a feel of how HList works:

// Create a new HList let h = hlist![1, "hi"]; assert_eq!(h.len(), 2); // Destructure the HList let hlist_pat!(a, b) = h; assert_eq!(a, 1); assert_eq!(b, "hi");

One thing to note: all type-checking with HLists happens at compile-time, so the elements inside the list don’t need to be boxed. The author has written a great introduction if you’d like to learn more about HLists in Rust.

The reason that I chose frunk for this problem is because it allows me to implement column-oriented storage without writing any macros.

The problem at hand

For my proof-of-concept column-oriented Vec storage, I chose to satisfy a minimal set of requirements:

Retrieve a single row

Retrieve a single column

Iterate over all rows

To meet these requirements, I built a column-oriented data structure that is an HList of columns. Briefly, I made Row and Frame traits, and implemented them for frunk::HNil and frunk::HCons .

Below, I’ll share some more details on my implementation.

Selecting a column

hlist already has a way to select columns using the hlist_pat macro. It looks like this:

let hlist_pat! [ _mass_kg , radius_m , ... ] = & planets ; assert_eq! ( radius_m , & [ 6.37814e6 , 3.3972e6 ]);

This is great for collections with only a few columns. Any type errors will be caught at compile time, which is cool.

I think this macro could be burdensome for collections with many columns. Perhaps it could be augmented to look something like:

let hlist_pat! [ SKIP_8 , flight_number , ... , departure_date , SKIP_5 ] = & flights ;

Selecting a row

Naively, I would want the method signature for selecting a row to be something like fn row(&self) -> &Planet . Unfortunately, no Planet actually exists in memory, so returning a reference is not straightforward at all. Instead, I pursued a method that creates and returns a new Planet .

Overall, this implementation wasn’t too difficult. One note is that all fields must implement Clone , because we need to be able to copy it from the column-oriented storage into the newly created Planet object.

To make this more efficient, you could make a function like fn write_row(&self, write_into: &mut Planet) to write a row from the frame into an existing Planet . This way, you could get the value of a row without needing to allocate new memory.

Iterating over rows

I wanted to implement iterators using closures. This could theoretically be done by giving Frame a method fn iter(&self) -> impl Iterator<Item=Self::MyRow> . Unfortunately, a trait can’t currently have a method with impl Trait in return position. To solve this, I made HConsIterator and HNilIterator structs. This is a fairly common pattern for transforming objects like Iterator and Future .

My implementation is not too efficient because it does a lot of allocations. Another option would be to use an iterator of references. This is not possible using Rust’s built-in Iterator trait, so you need a third-party lib like streaming_iterator.

Example code

Here’s the final example code that I came up with. You can see the whole project on GitHub.

# [ macro_use ] extern crate frunk ; extern crate frunk_core ; use frunk :: indices :: Here ; use frunk :: prelude :: * ; use frunk :: Generic ; use frunk_column :: * ; use std :: iter :: FromIterator ; fn main () { #[derive(Clone, Copy, Debug, Generic, PartialEq)] struct Planet { mass_kg : f64 , radius_m : f64 , habitable : bool , } let earth = Planet { mass_kg : 5.976e+24 , radius_m : 6.37814e6 , habitable : true , }; let mars = Planet { mass_kg : 6.421e+23 , radius_m : 3.3972e6 , habitable : false , }; let mut planets = < Planet as Generic > :: Repr :: new_frame (); planets .push ( frunk :: into_generic ( earth )); planets .push ( frunk :: into_generic ( mars )); let hlist_pat! [ _mass_kg , radius_m , ... ] = & planets ; assert_eq! ( radius_m , & [ 6.37814e6 , 3.3972e6 ]); assert_eq! ( planets .row ( 1 ), Some ( frunk :: into_generic ( mars ))); assert_eq! ( Vec :: from_iter ( planets .into_iter () .map :: < Planet , _ > ( frunk :: from_generic )), vec! [ earth , mars ] ); }

Reflections

Type ergonomics

To transform my struct into a frunk HList , I used the Generic derived trait from frunk . This was a bit inconvenient because I needed to add a lot of boilerplate in many situations. For example, creating a new frame:

let mut planets = < Planet as Generic > :: Repr :: new_frame (); planets .push ( frunk :: into_generic ( earth )); planets .push ( frunk :: into_generic ( mars ));

Or, collecting the rows into a Vec :

Vec :: from_iter ( planets .into_iter () .map :: < Planet , _ > ( frunk :: from_generic ))

I think it would be possible to smooth some of this over with convenience methods.

Trait ergonomics

One limitation is that I can’t implement any third-party traits (e.g. IntoIterator ) for Frame . This is because the structs for which I’ve implemented Frame are part of the frunk crate, so implementing IntoIterator would be an orphan trait. I worked around this by giving Frame a method with a similar signature: fn into_iter(self) -> Self::Iterator .

Conclusion

I was impressed that frunk let me implement column-oriented storage so quickly and easily. I didn’t have to write a single macro! However, I’d also like to protoype other solutions to solve the problems that I highlighted above.

I would love to hear any suggestions for further directions to explore for columnar data storage in Rust. If you have any, please contact me via the Reddit post, rust-lang.org forum thread, or via email.

Thanks to Ben Striegel, Anton Lazarev, and Brandon Maister for comments.

Appendix: Further Reading