[Haskell-cafe] Friday Afternoon Play: The Burrows Wheeler Transformation

So, it's Friday afternoon (well, here anyway), and presumably everyone is thinking of leaving for the weekend. Is anyone interested in playing a little game? I'm going to make the first move, and then just wait and see if anyone makes the next. I'm certainly not promising to make any further moves myself, but I'm not promising I won't, either. Below you'll see a naïve implementation of the Burrows Wheeler transformation and its inverse. That's my move. (And don't complain too much about style and such -- I'm half asleep.) Further moves are made by transforming the programme so that it becomes more efficient. The first target is not raw speed, but to adjust it until the computational complexity gets to be what it should be. I'm hoping for elegant moves here. After that, folk might start replacing the lists with more efficient datastructures and so on, but still /elegantly/. The object of the game isn't ultimate speed -- I've no particular wish to see a Haskell version that's as fast as a good C version, especially not if it ends up /looking/ like a C version. Really the objective is to think about Haskell programming -- and probably come up with some observations such as "there really ought to be a class of indexable structures, so that we don't have to change (!!) to (!) all over the place". And to have fun, anyhow... * * * Demonstration of the principle of the Burrows Wheeler transform. The implementations are simplistic (and have worse complexity than is possible). > module Burrows_Wheeler_Transform where > import List This is surely the wrong module to have to find (><) from: > import Data.Graph.Inductive.Query.Monad((><)) The forward transformation. Input a list, output the transformed list and possibly an int (Nothing if there were no elements in the input -- perhaps the result type should be Maybe (NonEmptyList t,Int)) > slow_bwt:: Ord t => [t] -> ([t], Maybe Int) This version works by computing a list of all the rotations of the input and sorting it. The transformation is then the last element of each line of the sorted rotations. Tag the data with an index so that it's possible to find where the original form of the unput ends up after sorting. This is essentially the specification of the transform coded up directly. > slow_bwt l > = (map fst tagged_bwt, index_of_original) > where tagged_bwt = map (last><id) > $ sort > $ rotations l`zip`[0..] > index_of_original = findIndex ((==0).snd) tagged_bwt that looks like worst case n²×log n to me (where n is the length of the string): if every element is the same, the n×log n comparisons the sort does will each look at every element, and that's really more than is necessary. The inverse transform. We should have that @ slow_inv_bwt . slow_bwt == id @ Observe that sorting the input gives the first column of the sorted rotations (and the input is the last). So after one sort we have X ... c Y ... b Z ... a . . . . . . . . . Since they are rotations, rotating each row of this gives pairs that belong to the original list: cX ... by ... aZ ... . . . . . . sorting these gives us the first two columns of the sorted rotations, and again we already know the last column, which can be rotated into the first and so on. > slow_inv_bwt ([],_) = [] > slow_inv_bwt (l,Just index_of_original) > = iterate catsort (replicate len []) After (length input) iterations we have the original sorted rotations, > !! len and we have the index of where the untransformed string is, so we can just pick it back out. > !! index_of_original > where catsort s = sort $ map (uncurry (:)) $ l `zip`s > len = length l This version does rather fewer sorts: > mark2_inv_bwt ([],_) = [] > mark2_inv_bwt (l,Just n) > = (transpose $ take (length l) $ iterate (permuteWith perm) l)!!(perm!!n) > where perm = map snd $ sort $ l `zip` [0..] Should I simply have left it out and hoped that it would turn up after a few moves? I'm not going to explain it, anyway. > rotations l = init $ zipWith (++) (tails l) (inits l) > permuteWith ns l = map (l!!) ns -- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk