Text Compression with the Burrows-Wheeler Transform

Text compression is an important and interesting computer application. If you master the program given on this page you'll become familiar with several common ideas in compression, and become acquainted with an amazing and elegant algorithm: the Burrows-Wheeler Transform. This is not my algorithm (I wish!), though the source code implementation and some of the details are mine.

You can compile this source and get a genuine compressor and decompressor. It will be slower than gzip (though fast enough for most use) and have a slightly inferior compression ratio, but just getting in the same ``ballpark'' as gzip is already remarkable, given the simple (and unoptimized) nature of this code.

The heart of this compression method is something called the ``Burrows-Wheeler Transform (BWT)'' which is one of those remarkable inventions that makes the study of algorithms a pleasure. The BWT was discovered many years ago but has never been widely publicized. Perhaps a certain major technology company could never quite find a use for it, but didn't want competitors to share the secret. If so I can sympathize with the inventors thus denied their due fame. (Perhaps Burrows or Wheeler would care to comment.)

The main source reference to the Burrows-Wheeler Transform appears to be ``A Block--sorting Lossless Data Compression Algorithm'' (SRC Research Report 124) by M. Burrows and D. J. Wheeler.

The BWT can be defined with an example. Consider the 51-character text fragment: in the jingle jangle morning I'll go following you

Let's build a 51 x 51 array with all possible rotations of that string: in the jingle jangle morning I'll go following you n the jingle jangle morning I'll go following you i the jingle jangle morning I'll go following you in the jingle jangle morning I'll go following you in he jingle jangle morning I'll go following you in t e jingle jangle morning I'll go following you in th jingle jangle morning I'll go following you in the jingle jangle morning I'll go following you in the ingle jangle morning I'll go following you in the j ngle jangle morning I'll go following you in the ji gle jangle morning I'll go following you in the jin le jangle morning I'll go following you in the jing e jangle morning I'll go following you in the jingl jangle morning I'll go following you in the jingle jangle morning I'll go following you in the jingle angle morning I'll go following you in the jingle j ngle morning I'll go following you in the jingle ja gle morning I'll go following you in the jingle jan le morning I'll go following you in the jingle jang e morning I'll go following you in the jingle jangl morning I'll go following you in the jingle jangle morning I'll go following you in the jingle jangle orning I'll go following you in the jingle jangle m rning I'll go following you in the jingle jangle mo ning I'll go following you in the jingle jangle mor ing I'll go following you in the jingle jangle morn ng I'll go following you in the jingle jangle morni g I'll go following you in the jingle jangle mornin I'll go following you in the jingle jangle morning I'll go following you in the jingle jangle morning 'll go following you in the jingle jangle morning I ll go following you in the jingle jangle morning I' l go following you in the jingle jangle morning I'l go following you in the jingle jangle morning I'll go following you in the jingle jangle morning I'll o following you in the jingle jangle morning I'll g following you in the jingle jangle morning I'll go following you in the jingle jangle morning I'll go ollowing you in the jingle jangle morning I'll go f llowing you in the jingle jangle morning I'll go fo lowing you in the jingle jangle morning I'll go fol owing you in the jingle jangle morning I'll go foll wing you in the jingle jangle morning I'll go follo ing you in the jingle jangle morning I'll go follow ng you in the jingle jangle morning I'll go followi g you in the jingle jangle morning I'll go followin you in the jingle jangle morning I'll go following you in the jingle jangle morning I'll go following ou in the jingle jangle morning I'll go following y u in the jingle jangle morning I'll go following yo in the jingle jangle morning I'll go following you

Now let's reorder the rows so they are in sorted order: I'll go following you in the jingle jangle morning following you in the jingle jangle morning I'll go go following you in the jingle jangle morning I'll in the jingle jangle morning I'll go following you jangle morning I'll go following you in the jingle jingle jangle morning I'll go following you in the morning I'll go following you in the jingle jangle the jingle jangle morning I'll go following you in you in the jingle jangle morning I'll go following 'll go following you in the jingle jangle morning I I'll go following you in the jingle jangle morning angle morning I'll go following you in the jingle j e jangle morning I'll go following you in the jingl e jingle jangle morning I'll go following you in th e morning I'll go following you in the jingle jangl following you in the jingle jangle morning I'll go g I'll go following you in the jingle jangle mornin g you in the jingle jangle morning I'll go followin gle jangle morning I'll go following you in the jin gle morning I'll go following you in the jingle jan go following you in the jingle jangle morning I'll he jingle jangle morning I'll go following you in t in the jingle jangle morning I'll go following you ing I'll go following you in the jingle jangle morn ing you in the jingle jangle morning I'll go follow ingle jangle morning I'll go following you in the j jangle morning I'll go following you in the jingle jingle jangle morning I'll go following you in the l go following you in the jingle jangle morning I'l le jangle morning I'll go following you in the jing le morning I'll go following you in the jingle jang ll go following you in the jingle jangle morning I' llowing you in the jingle jangle morning I'll go fo lowing you in the jingle jangle morning I'll go fol morning I'll go following you in the jingle jangle n the jingle jangle morning I'll go following you i ng I'll go following you in the jingle jangle morni ng you in the jingle jangle morning I'll go followi ngle jangle morning I'll go following you in the ji ngle morning I'll go following you in the jingle ja ning I'll go following you in the jingle jangle mor o following you in the jingle jangle morning I'll g ollowing you in the jingle jangle morning I'll go f orning I'll go following you in the jingle jangle m ou in the jingle jangle morning I'll go following y owing you in the jingle jangle morning I'll go foll rning I'll go following you in the jingle jangle mo the jingle jangle morning I'll go following you in u in the jingle jangle morning I'll go following yo wing you in the jingle jangle morning I'll go follo you in the jingle jangle morning I'll go following

Each row of this array has as much information as the entire array and (if accompanied by a small integer telling us which rotation to use) provides the original text string.

Each column contains the same 51 characters as each row, but they seem to be in a scrambled useless order. Let's show the first, middle and last columns of this sorted array, left-to-right instead of top-to-bottom:

First column: 'Iaeeefggggghiiiijjllllllmnnnnnnooooortuwy

Middle column: gjnolgg htlf'nlneowiIi g ll ie j nilgouneyoao rnm

Final column: golueeengI jlhl nnnn t nwj lgg'ol iiiiargfmylo oo

The first column contains the 51 characters but they are ``clumped.'' The way we constructed the array they will always be maximally clumped. Clumped data can be compressed better than unclumped data, but this sequence of the 51 characters won't tell us much: the original string could be any of its permutations.

The middle column doesn't exhibit clumping. It almost seems like a random permutation of the 51 letters (although of course it isn't really random at all).

The final column does exhibit clumping -- notice the four consecutive n's and four consecutive i's -- though not nearly as much as the first column. This clumping wasn't forced by the sorting, but arises from the repetition of digraphs in English. The clumps of n's and i's arise from ng and in digraphs in the words in, jingle, jangle, morning, and following.

The key idea of the Burrows-Wheeler invention is this: knowledge of just this final column allows us to reconstruct the entire 51 x 51 array (and hence the original text string).

Thus the Burrows-Wheeler Transform will replace the string in the jingle jangle morning I'll go following you with (22, golueeengI jlhl nnnn t nwj lgg'ol iiiiargfmylo oo )

(The 22 tells the decompressor to output the 22nd row of the reconstructed sorted array.)

In the source code below, I use a 32768 x 32768 array instead of 51 x 51. This array would be far too big of course, but the preceding discussion was just conceptual and the software need not actually build such an array physically. To prove that this transform is reversible, and thus useful for compression, we must display an inverse coding procedure. You can undo the transform step by step and finish with an impossibly expensive nest of sorting steps; however Burrows and Wheeler also discovered a straightforward, if non-obvious, procedure to invert the transform in almost linear time.

Referring to the example again, the compressor replaced a 51-character string with 52 characters (the index 22 must be transmitted also). There's no compression yet, but the BWT will be just the first of several steps in the compression: once the characters are ``clumped'' we can apply other compression techniques.

Next let's replace each character with an integer index. poppy for example might become (16, 15, 16, 16, 25) since p is the 16'th letter of the alphabet and so on. (We'll suppose that space is the ``0'th letter.'') We apply a simple idea called Move to Front (MTF), where each character is moved, as it is encountered, to the front of the indexing table, and hence given an index of zero (which will then increase slowly as other characters are moved to the front). Now poppy becomes (16, 16, 1, 0, 25). The first p is still 16 since we start with default indexes. The o is now 16 instead of 15 since p was moved in front of it. The second p is 1 (not 0, because o was just moved in front of it), but that moves p to the very front again so the third p is 0.

We still haven't achieved compression, but at least many of these indexes will be very small integers now. In fact, since we applied the BWT earlier, many of the indexes will be zero.

The next compression step is called Zero Running (ZR). Each non-zero number is now preceded with a count of preceding zeros. With this step, our example (16, 16, 1, 0, 25) will become ( 0 , 16, 0 , 16, 0 , 1, 1 , 25). Our five numbers, four of which were non-zero, have been replaced with eight numbers, two for each non-zero. (For clarity I show the zero run-counts in an alternate color.) If this looks like bad compression, remember that fifty numbers, only four of which are non-zero, would also map to an eight-number output.

Finally, we want to output the list of these numbers to the compressed file. Most of the numbers will be very small so we want to use a code which represents them with few bits. A Huffman code would be appropriate, but even simpler and almost as good is to use a standard regular pattern. We'll use Elias's ``gamma code,'' a popular and elegant device which goes by many names. Perhaps the simplest definition of this code is:

Elias's gamma code maps the positive integer K to K's binary representation with just enough leading zeros to put the first one in the exact middle of the code output. 9, for example, is represented 9 = 0001001.

The gamma code handles positive integers, but we need to handle zero as well. Simply add 1 before the encode, and subtract 1 after the decode.

That sums up our compression algorithm, and the source code should be pretty self-explanatory. (That the marvelous Burrows-Wheeler decoding algorithm works may not be obvious at first glance!) Much of the code just deals with the tedium of actually reading or writing bits to/from a disk or pipe. The code is followed by remarks on a few implemenatation details.

For your convenience if you want to download this software, I have arranged the code into a single source file (and without the HTML ``<'' strings that would cause trouble). After you snip the source at the obvious place and name it bwt.c the commands

cc -O -DCOMPRESSOR -o compressor bwt.c cc -O -o decompressor bwt.c compressor < bwt.c > bwt.c.ZZ decompressor < bwt.c.ZZ > bwt.c.reconstruction cmp bwt.c bwt.c.reconstruction

(Use the "View Source" option to extract this source, as your HTML browser will have changed parts of it when rendering.) #include #include #include /* Coder.h */ #define BIPSHO 8 * sizeof(short) #define LOGSIZ 15 #define BSIZE (1 << LOGSIZ) #define BITBUFSIZE 2048 #define UNDOBUF 3 /* max unget in shorts */ unsigned short Bitbuff[BITBUFSIZE + UNDOBUF]; int inshctr = BITBUFSIZE + UNDOBUF - 1, inbitrem = 0; long getbits(); /* BSIZE must be divisible by sizeof(short) ! */ /* * Need to handle byte order or compressed files * won't be portable across hosts. */ int Swabbing = 0; void ungetbits(int cnt) { inbitrem += cnt; while (inbitrem >= BIPSHO) { inbitrem -= BIPSHO; inshctr -= 1; } } void setswab() { union { char c[2]; short s; } u; if (u.s = 1, u.c[0] == 1) Swabbing ^= 1; } void cswab(short *sp, int cnt) { register unsigned x; if (Swabbing) while (cnt--) { x = *sp; *sp++ = x >> 8 & 0xff | (x & 0xff) << 8; } } /* IO subroutines that work with pipes */ int p_write(int fd, char *buf, int cnt) { int acnt, tcnt; for (acnt = 0; acnt < cnt; acnt += tcnt) { tcnt = write(fd, buf + acnt, cnt - acnt); if (tcnt <= 0) { perror("write"); exit(1); } } return acnt; } int p_read(int fd, char *buf, int cnt) { int acnt, tcnt = 1; for (acnt = 0; acnt < cnt && tcnt; acnt += tcnt) { tcnt = read(fd, buf + acnt, cnt - acnt); if (tcnt < 0) { perror("read"); exit(1); } } return acnt; } #ifdef COMPRESSOR u_char Decbuff[BSIZE + BSIZE]; u_char Encbuff[BSIZE]; u_short Sindex, Leng; u_char *Ptab[BSIZE]; int ptcomp(const void *pp1, const void *pp2) { register u_char *p1 = *(u_char **)pp1, *p2 = *(u_char **)pp2; register c, len = Leng; do c = *p1++ - (int)*p2++; while (c == 0 && --len); return c; } void bwt_encode() { int i; memcpy(Decbuff + Leng, Decbuff, Leng); for (i = 0; i < Leng; i++) Ptab[i] = Decbuff + i; qsort(Ptab, Leng, sizeof(u_char *), ptcomp); for (i = 0; i < Leng; i++) { Encbuff[i] = Ptab[i][Leng - 1]; if (Ptab[i] == Decbuff) Sindex = i; } } u_char Mtftab[256]; int mtf_encode(u_char din) { register i, resu; for (resu = 0; Mtftab[resu] != din; resu++) ; for (i = resu; i; i--) Mtftab[i] = Mtftab[i - 1]; Mtftab[0] = din; return resu; } void eg_encode(int datum) { int n, nn; ++datum; for (n = 1, nn = 2; nn <= datum; n += 2, nn += nn) ; putbits((long)datum, n); } int Numzeros = 0; /* * Modified Elias_gamma code for run-lengths is defined by the code-lengths * 1 2 4 4 6 6 6 6 8 8 8 8 8 8 8 ... * Modified Elias_gamma code for positive MTF is defined by the code-lengths * 2 2 4 4 4 4 6 6 6 6 6 6 6 6 8 ... */ void zr_encode(int datum) { if (datum == 0) ++Numzeros; else { if (Numzeros) { putbits(1L, 1); eg_encode(Numzeros - 1); } else putbits(0L, 1); Numzeros = 0; eg_encode(datum - 1 >> 1); putbits(!((long)datum & 1), 1); } } int outshctr, outbitctr; int putbits(unsigned long dval, int cnt) { if (cnt < 0) { /* Termination code */ putbits(0L, BIPSHO - 1); goto forcewrite; } if (cnt > BIPSHO) { putbits(dval >> BIPSHO, cnt - BIPSHO); cnt = BIPSHO; dval &= (1L << BIPSHO) - 1; } dval <<= BIPSHO * 2 - outbitctr - cnt; Bitbuff[outshctr] |= dval >> BIPSHO; outbitctr += cnt; if (outbitctr >= BIPSHO) { outbitctr -= BIPSHO; outshctr += 1; if (outshctr >= BITBUFSIZE) { forcewrite: cswab((short *)Bitbuff, outshctr); p_write(1, (char *)Bitbuff, outshctr * sizeof(short)); outshctr = 0; } Bitbuff[outshctr] = dval; } } int main(int argc, char **argv) { int i, ix; setswab(); for (i = 0; i < 256; i++) Mtftab[i] = i; /* Loop through blocks */ do { Leng = p_read(0, Decbuff, BSIZE); if (Leng == BSIZE) putbits(1L, 1); else putbits((long)Leng, LOGSIZ + 1); if (Leng == 0) break; bwt_encode(); putbits((long)Sindex, LOGSIZ); for (i = 0; i < Leng; i++) zr_encode(mtf_encode(Encbuff[i])); zr_encode(1); } while (Leng == BSIZE); /* special Code to flush output buffer */ putbits(0L, -1); exit(0); } #else u_char Decbuff[BSIZE + BSIZE]; u_char Encbuff[BSIZE]; u_char Sorted[BSIZE]; u_short Sindex, Leng; u_short Dchart[256]; u_short Echart[256]; u_short Link[BSIZE]; int charcomp(const void *p1, const void *p2) { return *(u_char *)p1 - (int)*(u_char *)p2; } /* This is the fabulous Burrows-Wheeler decoder. */ void bwt_decode() { register int i, m; register u_char b; memcpy(Sorted, Encbuff, Leng); qsort(Sorted, Leng, sizeof(u_char), charcomp); for (i = 0; i < 256; i++) Echart[i] = 0; for (m = -1, i = 0; i < Leng; i++) { if ((b = Sorted[i]) != m) Dchart[m = b] = i; } for (i = 0; i < Leng; i++) { b = Encbuff[i]; Link[Dchart[b] + Echart[b]++] = i; } for (m = Sindex, i = 0; i < Leng; i++) { Decbuff[i] = Sorted[m]; m = Link[m]; } } u_char Mtftab[256]; int mtf_decode(int ix) { u_char resu; for (resu = Mtftab[ix]; ix > 0; ix--) Mtftab[ix] = Mtftab[ix-1]; return Mtftab[0] = resu; } int Numzeros = -1; int zr_decode() { int res; if (Numzeros-- > 0) return 0; if (Numzeros != -1) { if (getbits(1)) { Numzeros = eg_decode(); return 0; } else Numzeros = -1; } res = (eg_decode() << 1) + 1; return res + getbits(1); } int eg_decode() { int n; switch (getbits(3)) { case 0: for (n = 3; getbits(1) == 0; n++) ; return (1 << n) + getbits(n) - 1; case 1: return 3 + getbits(2); case 2: return 1; case 3: return 2; case 4: case 5: case 6: case 7: ungetbits(2); return 0; } } long getbits(int cnt) { long result; int x; if (cnt > BIPSHO) { result = getbits(cnt - BIPSHO) << BIPSHO; return result | getbits(BIPSHO); } else if (cnt <= inbitrem) { result = Bitbuff[inshctr] >> inbitrem - cnt; inbitrem -= cnt; } else { result = (long)Bitbuff[inshctr] << cnt - inbitrem; if (++inshctr >= BITBUFSIZE + UNDOBUF) { for (x = 0; x < UNDOBUF; x++) Bitbuff[x] = Bitbuff[BITBUFSIZE + x]; p_read(0, (char *)(Bitbuff + UNDOBUF), (inshctr - UNDOBUF) * sizeof(short)); /* We assume the input file is valid; * Thus no need to check its length! */ cswab((short *)Bitbuff + UNDOBUF, inshctr - UNDOBUF); inshctr = UNDOBUF; } result |= Bitbuff[inshctr] >> BIPSHO - cnt + inbitrem; inbitrem -= cnt - BIPSHO; } return result & (1L << cnt) - 1; } int main(int argc, char **argv) { int i; setswab(); for (i = 0; i < 256; i++) Mtftab[i] = i; /* Loop through blocks */ do { Leng = getbits(1) ? BSIZE : getbits(LOGSIZ); if (Leng == 0) break; Sindex = getbits(LOGSIZ); for (i = 0; i < Leng; i++) Encbuff[i] = mtf_decode(zr_decode()); zr_decode(); bwt_decode(); p_write(1, Decbuff, Leng); } while (Leng == BSIZE); exit(0); } #endif

Miscellaneous remarks.

The ZR encoding only outputs for non-zero numbers, but the last number encoded might be zero (i.e., part of an ``unfinished run''). So we encode an arbitrary non-zero number, as a sort of ``placeholder,'' at the end (and remember to throw it away during decode). We do this not just at EOF, but whenever we need to insert non-ZR data into the code stream. Fortunately that is infrequent in this software so compression isn't degraded by these extra non-zeros. (Other solutions are possible if such degradation became an issue, but are more complicated.)

A slight variation of the Elias gamma (EG) code is used for the positive MTF codes; this yielded a 3% compression improvement. ( x/2 is submitted to the gamma code instead of x and the LSB of x is sent ``as is.'') We also use a variation of the gamma code for the zero-run codes, though this yielded only a further 1% improvement. One bit is spent indicating whether the run-length is zero; if not, the positive run-length is gamma coded normally.

ptcomp() could simply invoke memcmp(), but I found it significantly faster in my Linux environment to do the simple memcmp-like loop in-line. Here memcmp() will usually miscompare on an early character so in-line speed isn't surprising, but in my experience it's usually possible to outperform even large-sized memcpy()'s, just with the well-known device of partially ``unrolling'' the loop. It seems so worthwhile and important to optimize the library memcpy() and memcmp() routines that you may find it odd that vendors don't, but in my experience these routines have been consistently suboptimal. As just one example, I found I could get dramatic improvement on an imaging program for the Sun-3 workstation (68020 chip) by writing my own memcpy(): on investigation I found that the Sun-3's memcpy() routine had been optimized for the obsolete Sun-2 (68010 chip). (A related, and even more astounding, example is integer-multiply on Sun-4 workstations, where the library routine can be sped up by a large factor -- though surely Sun's fixed this by now! ?.)

In my experiments I found some other ways to improve this code, but I did not incorporate them because the improvements were modest and I wanted to keep this example simple. Nevertheless I will mention two of them:

ptcomp() can be partially unrolled for speed: if (c = p1[0] - (int)p2[0]) return c; else if (c = p1[1] - (int)p2[1]) return c; else if (c = p1[2] - (int)p2[2]) return c; . . . (In the qsort() invocation I detect if Leng is below some minimum and if so use an unenhanced ptcomp() .)





can be partially unrolled for speed: (In the invocation I detect if is below some minimum and if so use an unenhanced .) The policy of moving each character encountered to the very front of the MTF list is too aggressive, although repeated characters must always be encoded with the best index (0). Instead an update like newindex = oldindex < 5 ? 0 : oldindex * 0.4 - 1; worked well. Added code is used to detect repeated characters and encode them as zero. The actual index (previndex) of that character is also remembered so the encoded index can be adjusted: if (oldindex == previndex) encodee = 0; else if (oldindex < previndex) encodee = oldindex + 1; else encodee = oldindex; previndex = newindex;

Please send me some e-mail.

Go back to my home page.