Tim Bray has recently been writing about a simple log file processing task, giving his efforts the (decidedly peculiar) name of Wide Finder. The task at hand is to count popular links in an Apache log file.

Here’s my two minutes’ worth of hat in the ring, in Haskell.

> main = do > args <- getArgs > forM_ args $

ame -> do > m <- (foldl' count M.empty . LB.lines) `fmap` LB.readFile name > mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m) > where on f g x y = g x `f` (g y) > count m line = case line =~ "\"GET /en/([^ ]+\\.html)" of > ((_:g:_):_) -> M.insertWith' (+) g 1 m > _ -> m :: M.Map LB.ByteString Int

Here’s some comparable Python:

> pat = re.compile(r'.*"GET /en/([^ ]+\.html)') > for name in sys.argv[1:]: > d = {} > for line in open(name): > m = pat.match(line) > if m: > d[m.group(1)] = d.setdefault(m.group(1), 0) + 1 > for i in sorted(d.items(), key=lambda x:x[1], reverse=True)[:10]: > print i

The Haskell code can chew through 3.2 million records in 10.5 seconds on my laptop, while the Python takes 11.6 seconds.

Both of these programs spend about 90% of their time in regexp matching code, which makes this just a regexp engine and I/O benchmark. Yawn! Can we squeeze a little bit of entertainment out of the problem?

This is a trivially parallelisable problem: there are no data dependencies between different parts of a log file, so we can process them however we please.

Let’s split the input file into chunks of approximately equal size, aligned to line boundaries. This is dead easy to do with almost no I/O: just seek to the nearest chunk boundary, and read a little until we hit a newline.

> chunkedLineBoundaries :: Int -> FilePath -> IO [(Int64, Int64)] > chunkedLineBoundaries numChunks path = do > totalSize <- (fromIntegral . fileSize) `fmap` getFileStatus path > let chunkSize = totalSize `div` fromIntegral numChunks > bracket (openFile path ReadMode) hClose $ \h -> > flip fix 0 $ \findOffsets offset -> do > let newOffset = offset + chunkSize > hSeek h AbsoluteSeek (fromIntegral newOffset) > flip fix newOffset $ \loop off -> do > eof <- hIsEOF h > if eof > then return [(offset, totalSize - offset)] > else do > bytes <- LB.hGet h 4096 > case LB.elemIndex '

' bytes of > Just n -> do > offsets <- findOffsets (off + n + 1) > return ((offset, fst (head offsets) - offset):offsets) > Nothing -> loop (off + LB.length bytes)

The chunkedLineBoundaries function returns a list of (offset, length) pairs. We’ll use this to fire off multiple threads, each of which will consume a single chunk of the file in parallel.

> withChunks :: Int -> (LB.ByteString -> a) -> FilePath -> IO [a] > withChunks numThreads f path = do > offsets <- chunkedLineBoundaries numThreads path > ch <- newChan > forM_ offsets $ \(offset, count) -> forkIO $ > handle (writeChan ch . Left) $ > bracket (openFile path ReadMode) hClose $ \h -> do > hSeek h AbsoluteSeek (fromIntegral offset) > ret <- (f . LB.take count) `fmap` LB.hGetContents h > ret `seq` writeChan ch (Right ret) > forM offsets (const (readChan ch >>= either throwIO return))

With this process-a-file-in-chunks function in hand, we must restructure our original code a little to fit in. Here’s the core scan-and-update-the-map loop, which does no I/O.

> reCountLines :: LB.ByteString -> M.Map LB.ByteString Int > reCountLines = foldl' count M.empty . LB.lines > where count m line = case line =~ "\"GET /en/([^ ]+\\.html)" of > ((_:g:_):_) -> M.insertWith' (+) g 1 m > _ -> m

We’ll give it an alternate name so we can swap in a better implementation later.

> countLines = reCountLines

Because this function does no I/O, we can run it either sequentially or in parallel.

> sequential = fmap countLines . LB.readFile > parallel = fmap (M.unionsWith (+) . map snd) . withChunks 2 countLines

The parallel function takes the maps returned by each thread and reduces them into a single map, giving a result of exactly the same type as the sequential function.

> -- kind = sequential > kind = parallel

By changing the definition of kind above, we can switch between the sequential and parallel versions of our code. Now main becomes just a framework:

> main = do > args <- getArgs > forM_ args $

ame -> kind name >>= \m -> > mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m) > where on f g x y = g x `f` g y

In order to benefit from the potential parallelism, we have to recompile to use GHC’s threaded runtime. This imposes about a 4% penalty in execution time, so the serial version of the code processs our 3.2 million records in 10.9 seconds instead of 10.5.

Switching to the parallel code, it takes 7.7 seconds to process the same data. We get a less than perfect speedup in part because GHC’s garbage collector runs serially; that results in about 0.7 seconds of serial execution. Still, this is almost twice as fast as the Python code.

Next, let’s get rid of the gratuitous regular expressions, since they’re surely doing a lot more work than necessary for such a simple problem. Here’s a short handwritten replacement:

> fastCountLines :: LB.ByteString -> M.Map LB.ByteString Int > fastCountLines = foldl' count M.empty . LB.lines > where count m line = > let quote = LB.drop (fromJust (LB.elemIndex '\"' line)) line > in if LB.pack "\"GET /en/" `LB.isPrefixOf` quote > then let pfx = LB.drop 9 quote > uri = LB.take (fromJust (LB.elemIndex ' ' pfx)) pfx > in if LB.pack ".html" `isSuffixOf` uri > then M.insertWith' (+) uri 1 m > else m > else m

Using fastCountLines as the value of countLines , this brings best-case serial execution time (i.e. without the threaded runtime) down to 5.1 seconds, and parallel execution time drops to 3.5 seconds, or a third the time required by the original serial-with-regexps Haskell code.

I would expect a four-core machine to further improve performance, though with an added drop in speedup due to GHC’s single-threaded garbage collector.