I really did not intend this blog to become a repository of Haskell code snippets, but I’ve been rather busy as of late, and writing toy code while waiting for a compile to finish has somehow become my primary means of entertainment. Here is the latest.

Arithmetic coding is a remarkably simple and clever thing. The idea is that given some half-open interval [a,b), that is, the interval a <= x < b, we can partition it into half-open subintervals, such that there is one subinterval per character in the message to be encoded, and the lengths correspond to the character frequencies multiplied by b-a. The same procedure is applied, recursively, to each subinterval, resulting in an infinite hierarchy of coverings of the original interval — call it S. Now, if we throw a rock at S, record the point where it hit, and follow the interval hierarchy, we’ll come up with a unique infinite string of characters.

To construct the actual encoding, set S to [0,1), and find out which subinterval S_1 the first character of the message falls into. For the second character, let S_2 be the appropriate subinterval of S_1, for the third character, let S_3 be the appropriate subinterval of S_2, and so on; if we repeat this procedure as many times as there are characters, we’ll arrive at some interval S_n. Numbers that fall in this interval have a useful property: given any such number, call it x, we have x in S_{n-1} (since x is in S_n, and S_n is a subinterval of S_{n-1}), x in S_{n-2} by the same argument, and, by induction, in every subinterval that we picked while encoding the message. Any such x, therefore, uniquely encodes the message: to decode, simply follow the hierarchy.



*Arith> encodeToStream “encodeToStream returns a pair of lists of bytes, represe

nting the numerator and denominator, respectively.”

([174,77,70,217,88,196,42,26,75,253,160,72,114,92,77,135,32,165,50,80,55,77,233,

103,172,90,177,4],[211,29,119,249,50,167,209,90,128,245,114,158,13,236,212,196,1

1,81,64,169,125,254,83,235,75,2,30,13])

*Arith> encode “testing testing testing”

(23,[(‘ ‘,(0%1,2%23)),(‘e’,(2%23,5%23)),(‘g’,(5%23,8%23)),(‘i’,(8%23,11%23)),(‘n

‘,(11%23,14%23)),(‘s’,(14%23,17%23)),(‘t’,(17%23,1%1))],3430733247%4363211066)

*Arith> (decode . encode . decode . encode) “testing testing testing”

“testing testing testing”

*Arith> encode (concat $ replicate 500 “abcd”)

(2000,[(‘a’,(0%1,1%4)),(‘b’,(1%4,1%2)),(‘c’,(1%2,3%4)),(‘d’,(3%4,1%1))],9%85)



The last test shows the output of ‘encode’ : the length of the message is 2000 characters, this is followed by character distributions (in a practical setting, frequencies would be returned instead of explicit intervals), and finally the encoded message. The entire 2000 byte string is encoded in the fraction 9/85.

Toy code follows. As mentioned earlier, ‘encodeToStream’ is a helper function that breaks the fraction into a pair of lists of bytes; the actual encoder and decoder consist of just ‘encode’, ‘decode’ and ‘freqRanges’, weighing in at 23 lines of code including type annotations and line breaks. Gotta love Haskell.

{ -# OPTIONS - fglasgow - exts #- } module Arith where import Ratio import Data . List import Data . Maybe import Data . Char import qualified Data . Map as M type RangeMap k a = [ ( k , ( Ratio a , Ratio a ) ) ] encode :: ( Ord k , Integral a ) => [ k ] -> ( Int , RangeMap k a , Rational ) encode msg = ( length msg , M . assocs freqMap , best $ foldl pair ( 0 , 1 ) rmap ) where freqMap = freqRanges msg rmap = map ( \ x -> fromJust $ M . lookup x freqMap ) msg best ( a , b ) = approxRational ( ( b + a ) / 2 ) ( ( b - a ) / 2 ) pair ( a , b ) ( x , y ) = ( ( b - a ) * x + a , ( b - a ) * y + a ) decode :: ( Ord a , Integral a ) => ( Int , RangeMap k a , Ratio a ) -> [ k ] decode ( n , freqs , code ) = take n $ decode' code where findChar x = find ( \ ( c , ( a , b ) ) -> ( x >= a ) && ( x < b ) ) freqs decode' code = let ( Just ( c , ( x , y ) ) ) = findChar code in c : decode' ( ( code - x ) / ( y - x ) ) freqRanges :: ( Ord k , Integral a ) => [ k ] -> M . Map k ( Ratio a , Ratio a ) freqRanges str = snd $ M . mapAccum ( \ acc x -> ( acc + x , ( acc , acc + x ) ) ) 0 freqs where freqs = M . map ( \ p -> p % total ) occurences occurences = foldl ( \ m c -> M . insertWith ( + ) c 1 m ) M . empty str total = sum ( M . elems occurences ) encodeToStream msg = let ( len , freqs , code ) = encode msg ( num , denom ) = ( numerator code , denominator code ) bytes n = unfoldr ( \ k -> if k == 0 then Nothing else Just ( rem k 256 , quot k 256 ) ) n in ( bytes num , bytes denom )