$\begingroup$

A column-oriented database (=columnar data-store) stores the data of a table column by column on the disk, while a row-oriented database stores the data of a table row by row.

There are two main advantages of using a column-oriented database in comparison with a row-oriented database. The first advantage relates to the amount of data one’s need to read in case we perform an operation on just a few features. Consider a simple query:

SELECT correlation(feature2, feature5) FROM records

A traditional executor would read the entire table (i.e. all the features):

Instead, using our column-based approach we just have to read the columns which are interested in:

The second advantage, which is also very important for large databases, is that column-based storage allows better compression, since the data in one specific column is indeed homogeneous than across all the columns.

The main drawback of a column-oriented approach is that manipulating (lookup, update or delete) an entire given row is inefficient. However the situation should occur rarely in databases for analytics (“warehousing”), which means most operations are read-only, rarely read many attributes in the same table and writes are only appends.

Some RDMS offer a column-oriented storage engine option. For example, PostgreSQL has natively no option to store tables in a column-based fashion, but Greenplum has created a closed-source one (DBMS2, 2009). Interestingly, Greenplum is also behind the open-source library for scalable in-database analytics, MADlib (Hellerstein et al., 2012), which is no coincidence. More recently, CitusDB, a startup working on high speed, analytic database, released their own open-source columnar store extension for PostgreSQL, CSTORE (Miller, 2014). Google’s system for large scale machine learning Sibyl also uses column-oriented data format (Chandra et al., 2010). This trend reflects the growing interest around column-oriented storage for large-scale analytics. Stonebraker et al. (2005) further discuss the advantages of column-oriented DBMS.

Two concrete use cases: How are most datasets for large-scale machine learning stored?

(most of the answer comes from Appendix C of: BeatDB: An end-to-end approach to unveil saliencies from massive signal data sets. Franck Dernoncourt, S.M, thesis, MIT Dept of EECS)