This site may earn affiliate commissions from the links on this page. Terms of use

Of all the various backup companies on the market, few have documented their work and research as thoroughly as Backblaze. The company has previously made headlines for open-sourcing both the underlying hardware design that it uses for its Storage Pods and its hard drive reliability data (the latter early this year). Now, Backblaze is opening up another facet of its operation — the implementation of its Reed-Solomon error-correcting codes.

Reed-what?

Reed-Solomon error correcting codes are a critically important underpinning of computing that you’ve probably never heard of. First created by Irving S. Reed and Gustave Solomon in 1960, they form the basis of error correction as it’s used in a huge number of products, including CDs, DVDs, Blu-rays, QR Codes, multiple data transmission standards, broadcast television standards, and RAID 6 itself. Not every parity system in the world directly relies on Reed-Solomon, but the ideas behind Reed-Solomon are the basis for error correction.

The original algorithms describe a method of encoding data with parity. What Reed and Solomon collectively showed was that it was possible to build an effective decoder for the parity calculations that are required in order to determine if a data stream has been corrupted or not. Exactly how much parity is encoded within a transmission is highly dependent on a number of various factors — I’m not ashamed to admit that the math begins to escape me past a certain point.

Backblaze’s Java implementation of Reed-Solomon is designed to preserve a Vault into 17 separate shards and to calculate three parity shards. The Storage Pod hardware is capable of processing incoming data at roughly 149MB/s, when operating on uncached data (the library and data are available from Github). Backblaze has included an example of how Reed-Solomon codes are calculated, which we’ve included below:

The examples below use a “4+2” coding system, where the original file is broken into 4 pieces, and then 2 parity pieces are added. In Backblaze Vaults, we use 17+3 (17 data plus three parity). The math, and the code, works for any numbers as long as you have at least one data shard and don’t have more that 256 shards total.

To use Reed-Solomon, you put your data into a matrix. For computer files, each element of the matrix is one byte from the file. The bytes are laid out in a grid to form a matrix. If your data file has “ABCDEFGHIJKLMNOP” in it, you can lay it out like this:

In this example, the four pieces of the file are each 4 bytes long. Each piece is one row of the matrix. The first one is “ABCD”. The second one is “EFGH”. And so on.

The Reed-Solomon algorithm creates a coding matrix that you multiply with your data matrix to create the coded data. The matrix is set up so that the first four rows of the result are the same as the first four rows of the input. That means that the data is left intact, and all it’s really doing is computing the parity.

The result is a matrix with two more rows than the original. Those two rows are the parity pieces.

Each row of the coding matrix produces one row of the result. So each row of the coding matrix makes one of the resulting pieces of the file. Because the rows are independent, you can cross out two of the rows and the equation still holds.

And with those rows completely gone it looks like this:

Because of all the work that mathematicians have done over the years, we know the the coding matrix, the matrix on the left, is invertible. There is an inverse matrix that, when multiplied by the coding matrix produces the identity matrix. As in normal algebra, in matrix algebra you can multiply both sides of an equation by the same thing. In this case, we’ll multiply on the left by the identity matrix:

The inverse matrix and the coding matrix cancel out.

Which leaves the equation for reconstructing the original data from the pieces that are available:

So, to make a decoding matrix, the process is to take the original coding matrix, cross out the rows for the missing pieces, and then find the inverse matrix. You can then multiply the inverse matrix and the pieces that are available to reconstruct the original data.

Backblaze hasn’t clarified exactly why they’re open-sourcing this section of their code, but presumably it’s to offer useful data and inspiration to individuals looking to roll their own solutions — or just to get eyes and fingers on the codebase in the hopes of improving it further.