First, an aside: the motivation behind this post was some recent research in sparse matrix-dense vector multiplication, and the lack of an up-to-date plain English introduction to various sparse matrix formats. I write this post in the hope that it can be useful to other CS folks who want to better understand how sparse matrices are actually stored and the trade-offs between different formats. If you’re just curious, feel free to read the first section and stop there. If you want a wide-but-not-deep survey of modern sparse matrix formats, the full post will be useful. If you want a wide-but-also-deep survey, stay tuned for future posts on performance characterization of different formats.

All non-trivial software applications operate on collections of objects. These objects may represent customers, particles, anything with one or more attributes. The importance of storing and analysing large collections of objects has spawned fundamental fields in computer science like databases, data analytics, distributed systems, and more.

Objects are to software development as matrices are to applied math. Matrices are a unifying data representation. Matrices store a collection of items (one per row), with attributes for each item (across columns). They can be used to store relations between items, just as objects can store object references. The importance of matrices in real world scientific and engineering problems cannot be understated: they are the key abstraction that applied mathematicians use to express their problems across many domains.

Matrices can be divided into two classifications: dense and sparse. The “sparsity” of a matrix refers to how many non-zero values it stores, relative to the total number of cells. There’s no hard-and-fast threshold for when a matrix flips between being dense or sparse. The simplest approach we can take is to say a matrix is sparse when representing it in a storage format specialized for sparsity gets us some savings, in either space or time.

Let’s start by looking at how dense matrices are stored. If we’re talking in C, we can easily represent a dense matrix as an array:

double matrix[10][10];

Referencing matrix[0][0] gets us the leftmost element in the topmost row in the matrix.

If the size of a matrix is not fixed, we can use dynamic memory allocation to get the same effect by flattening the array:

double *matrix = (double *)malloc(nrows * ncols * sizeof(double));

(often you’ll see the number of rows in a matrix denoted by “M” and the number of columns denoted by “N”. I use nrows and ncols here to keep things straightforward for people not familiar with this convention).

So storing this dense matrix costs us nrows * ncols * sizeof(double) bytes. Say we have a square matrix with 1,000,000 elements on each side, and sizeof(double) is 8 bytes. Then we’re paying 8TB to store this matrix in memory. With machines today topping out at around 1TB of memory it isn’t inconceivable that in the near future storing an 8TB matrix in memory will be possible. But for most of us mere mortals with a few GB of memory in our laptops, this dense representation really isn’t viable for large matrices.

Fortunately, many real-world matrices are sparse, i.e. contain many more zeros than non-zeros. Using specialized formats for sparse matrices can gain you both significant time and space savings over naively storing them in a dense format.

The simplest sparse format is the Coordinate format (COO). In COO you represent a matrix with three arrays: a rows array, a columns array, and a values array. For every non-zero value in the original sparse matrix, there is an entry at index i in the rows array, columns array, and values array that stores the row, column, and value of that non-zero item. Suppose you have the matrix below:

columns

0 1

row 0 [ 1.0 2.0 ]

row 1 [ 0.0 4.0 ]

Storing this matrix in COO would produce the following three arrays:

unsigned rows[3] = { 0, 0, 1 };

unsigned columns[3] = { 0, 1, 1 };

double values[3] = { 1.0, 2.0, 4.0 };

indicating that at cell (0, 0) the value 1.0 is stored, at cell (0, 1) the value 2.0 is stored, and at cell (1, 1) the value 4.0 is stored. Note that we store no explicit information on (1, 0) and so its value is implicitly set to 0.0.

Now say I asked for the value stored at cell (1, 1) in the original matrix. Finding it is a simple matter of iterating over the entries in rows and columns, checking at each if rows[i] == 1 and columns[i] == 1. If they both match the target coordinates, the result is stored in values[i]. If no exact match is found, the result must be 0.0.

Let’s go back to our massive 1,000,000 x 1,000,000 matrix and guess that 5% of all cells have a non-zero value in them, meaning 50,000,000,000 non-zero values in total. Using COO (and assuming sizeof(unsigned) == 4), storing this massive matrix then costs us:

(sizeof(unsigned) + sizeof(unsigned) + sizeof(double)) * 5E10

=> 16 bytes * 5E10

=> 800 GB

That’s a pretty significant savings over the 8TB we had to pay with the dense format! Note that the size of the COO format doesn’t scale linearly with sparsity, 5% non-zeroes only brought the matrix size down to 10% of the dense format’s size. Still, at least now we’re only two orders of magnitude above what a modern laptop can store!

We can even derive the tipping point where using the COO format saves space relative to the dense format, and use that as a working definition for when a matrix is “sparse” vs. “dense”. We need to find when storing 16 bytes for each non-zero value is cheaper than storing 8 bytes for every cell, or:

16 * nnz < 8 * nrows * ncols

where nnz is a common shorthand for the total “number of non-zeroes” in a matrix. If we also define a value sparsity that is the percentage of cells that are non-zero, we can derive:

16 * (sparsity * nrows * ncols) < 8 * nrows * ncols

=> sparsity * nrows * ncols < 0.5 * nrows * ncols

=> sparsity < 0.5

Whenever the sparsity of a matrix is below 50% (i.e. less than half of its values are non-zero), we can get space savings from storing it in the sparse COO format, relative to the dense format. Note that while this property does not vary with nrows or ncols, it does vary with the # of bytes needed to store a value in each format. This relation could be more abstractly defined as:

# of bytes per value in dense

sparsity < -----------------------------

# of bytes per value in COO

So COO is great because it allows us to save space on some classes of sparse matrices, relative to the dense format. But what’s wrong with COO? Well, a lot of things as it turns out. First of all, the space to store each element has doubled. Sure, we get space savings on some matrices, but those 16 bytes per non-zero still seem wasteful. Especially when all of the extra space is being used to store data that is going to be very repetitive: all of the values stored in the rows array are going to be limited to being >=0 and < nrows, with a similar property holding in the columns array. Additionally, say I asked you to give me the value stored at cell (980, 1020). If we don’t assume any sorting of the rows or columns arrays, that single lookup will require looking at every single non-zero value. In the dense format, you would just jump to offset 980 * ncols + 1024 in your array and be done with it.