Haskell Performance: Array Creation

Ran into another interesting performance problem while converting a small test program over to Haskell. Let’s say that you want to walk through every line of a text file, collate character frequencies, and return anything that maps to a particular frequency. For purposes of explanation we’ll do something really silly like look for lines with 10 capital ‘A’s.

{-# OPTIONS -XBangPatterns #-} import IO import System import Data.Word import Data.Array.Unboxed import Control.Monad.ST import Data.Array.ST import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Char8 as C counts' ! line = do arr <- newArray ( 0 , 255 ) 0 :: ST s ( STUArray s Word8 Int ) -- collate character counts here return arr counts ! line = runSTUArray ( counts' line ) hit ! line = let freq = counts line in freq ! 65 == 10 main = do args <- getArgs f <- openFile "wordlist" ReadMode text <- B . hGetContents f print $ length $ filter hit ( C . lines text )

This is fast (0.07s on 300K test file, compiled with -O2 on a Macbook Pro), but there’s no actual collation going on. It suddenly becomes two magnitudes slower as soon as you start to do anything based on the line:

-- version 1 : 9.5 seconds counts' ! line = do arr <- newArray ( 0 , 255 ) 0 :: ST s ( STUArray s Word8 Int ) readArray arr ( B . head line ) >>= \ v -> writeArray arr ( B . head line ) ( v + 1 ) return arr counts ! line = runSTUArray ( counts' line )

.. equals 9.54 seconds just when you collate the first character of the string, and led me to believe that ByteStrings were slow, especially since a constant change to the array was fast (0.07 seconds):

-- version 2 : 0.07 seconds counts' ! line = do arr <- newArray ( 0 , 255 ) 0 :: ST s ( STUArray s Word8 Int ) readArray arr 0 >>= \ v -> writeArray arr 0 ( v + 1 ) return arr

(Here I’ll elide the hours of me messing around with profiling and whatnot to try and figure out why B.head was so slow, or how to make B.foldl’ update a state variable, or sprinkling strictness bangs all over the place, or the other tons of false trails that I went down.)

It turns out that the slow portion in all of this is the newArray, not the character collation itself, which can be seen in the profile if the array creation is moved into its own routine and we look at the profile:

-- version 3 -- compile with: ghc -package bytestring stuarray.hs -prof -auto-all -O2 -o stuarray.out -- run with: ./stuarray.out +RTS -p initial_array = do arr <- newArray ( 0 , 255 ) 0 :: ST s ( STUArray s Word8 Int ) return arr counts' ! line = do arr <- initial_array let ix = B . head line readArray arr ix >>= \ v -> writeArray arr ix ( v + 1 ) return arr counts ! line = runSTUArray ( counts' line )

… and the relevant part of the profile:

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 240 1 0.7 0.3 100.0 100.0 hit Main 242 335075 0.0 0.0 99.3 99.7 counts Main 243 335075 0.0 0.2 99.3 99.7 counts' Main 244 335075 0.0 0.0 99.3 99.5 initial_array Main 245 335075 99.3 99.5 99.3 99.5

Which brings up two interesting points:

The Haskell optimizer was so smart that it was able to figure out that the code in version 2 built a constant array… and so only called the newArray once. (Confirmed with a profile).

New array creation seems to be so slow that it dominates a benchmark that has file I/O.

The Ocaml version took about 0.720 seconds (Ocaml version below), compared to Haskell time of about 8.5 seconds. Reducing the array size to 128 in both cases reduced it to 0.112 seconds for Ocaml and 4.3 seconds in Haskell.

Array size Ocaml Haskell 128 0.112 4.3 255 0.184 8.5 256 0.720 8.5

Note the discontinuity in Ocaml between 255 and 256 elements, which I find interesting. The nice people in #haskell suggested that I stop constructing/destructing the array and instead just null it out between each line, and it turned out that the best way to do that was just to thaw a starter array.

-- version 4: 0.11 seconds initial_array = listArray ( 0 , 255 ) ( repeat 0 ) :: UArray Word8 Int counts' ! line = do arr <- thaw initial_array let ix = B . head line readArray arr ix >>= \ v -> writeArray arr ix ( v + 1 ) return arr counts ! line = runSTUArray ( counts' line )

This is still kind of strange to me (because an object is still getting constructed/destructed – thaw guarantees a copy) but you can’t argue with a performance increase. (Interestingly, using Array.copy or Array.fill in Ocaml is slower than just using Array.make). Final times? 0.11 seconds in Haskell vs. the best of 0.720 second UPDATED: 0.12 seconds in Ocaml… and the Haskell version is just 50% slower than a C implementation with a gratuitous calloc instead of a memset.

UPDATE: An pointed out that zeroing the Ocaml array with a for loop is much faster than Array.copy or Array.fill, bringing it to about the same speed as Haskell.

Lessons learned?

Performance is a treacherous mistress

The Haskell optimizer is awesome but you have to be wary when trying to narrow down performance problems

#haskell is always full of useful suggestions

Array creation seems to be slow enough that alternatives should be explored.

open Char open String let all_lines fn filename = let chan = open_in filename in try while true do let line = input_line chan in fn line done with End_of_file -> close_in chan let initial_array = Array . make 256 0 let count line = let freq = initial_array in let ix = int_of_char line .[ 0 ] in for i = 0 to 255 do freq .( i ) <- 0 done ; freq .( ix ) <- freq .( ix ) + 1 ; freq let hit line = let freq = count line in freq .( 65 ) = 10 let _ = let linecount = ref 0 in all_lines ( fun line -> if hit line then linecount := ! linecount + 1 else () ) "wordlist" ; print_int ! linecount

Discussion

Comments are moderated whenever I remember that I have a blog.