From HaskellWiki



Some implementations of the 'wc -l' program in Haskell, with an eye to C-like performance. This illustrates the balance to be made between performance and elegance, over several increasingly fast (and more complex) examples.

Baseline

The baseline is the C program 'wc'

$ du -hsL /usr/share/dict/words 912K /usr/share/dict/words $ time wc -l < /usr/share/dict/words 98326 wc -l < /usr/share/dict/words 0.00s user 0.00s system 27% cpu 0.015 total

So the best we can probably hope to get is around 0.015s

Standard [Char]

main :: IO () main = print . length . lines =<< getContents

$ ghc -O wc.hs $ time ./wc < /usr/share/dict/words 98326 ./wc < /usr/share/dict/words 0.10s user 0.00s system 94% cpu 0.106 total

Ok. About 10x C, as to be expected with a list representation.

Faster [Char]

Perhaps writing our loop, rather than the duplication involved in length . lines, will improve things:

main :: IO () main = interact ( count 0 ) where count i [] = show i count i ( '

' : xs ) = count ( i + 1 ) xs count i ( _ : xs ) = count i xs

$ ghc -O wc.hs $ time ./wc < /usr/share/dict/words 98326./wc < /usr/share/dict/words 0.06s user 0.00s system 87% cpu 0.073 total

Ok. Not too bad.

Data.ByteString

Try to improve performance by using the Data.ByteString library. This uses packed byte arrays instead of heap-allocated [Char] to represent strings.

import qualified Data.ByteString.Char8 as B main :: IO () main = B . getContents >>= print . B . count '

'

$ time ./wc < /usr/share/dict/words 98326 ./wc < /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total

Much better, it is now becoming competitive with C. This (and the Data.ByteString.Lazy example below) is as fast as we'll get.

Data.ByteString.Lazy

Or we could use the new lazy bytestring type, a lazy list of strict, L1-cache-sized chunks of bytes. This example due to Chad Scherrer:

import qualified Data.ByteString.Lazy.Char8 as L main :: IO () main = L . getContents >>= print . L . count '

'

$ time ./a < /usr/share/dict/words 98326 ./a < /usr/share/dict/british-english 0.00s user 0.00s system 25% cpu 0.016 total

Line-by-line processing

We can ask the bytestring library to hand us a string line at a time.

import System.IO import Data.ByteString ( hGetLines ) main = hGetLines stdin >>= print . length

$ time ./b < /usr/share/dict/british-english 98326 ./b < /usr/share/dict/british-english 0.04s user 0.01s system 94% cpu 0.055 total

Though this is a bit slower, since it needs to hang on to the lines for longer.

Ptr hacking

ByteStrings give you access to the underlying pointers to bytes in memory, which can be used to optimise some particular code. So when the ByteString api doesn't provide what you want, you can step inside the ForeignPtr and go nuts.

This example also makes use of a cpp macro to force strictness on a function, via a seq guard case.

import Foreign import Foreign.ForeignPtr import System.Environment import qualified Data.ByteString as B # define STRICT4 ( f ) f a b c d | a ` seq ` b ` seq ` c ` seq ` d ` seq ` False = undefined main = head ` fmap ` getArgs >>= B . readFile >>= \ ( B . PS x _ l ) -> withForeignPtr x $ \ p -> go p l 0 0 where go :: Ptr Word8 -> Int -> Int -> Int -> IO () STRICT4 ( go ) go p l n i | n >= l = print i | otherwise = do ( w :: Word8 ) <- peek ( p ` plusPtr ` n ) go p l ( n + 1 ) $ if w == 0x0a then ( i + 1 ) else i

$ ghc -O -package fps -fglasgow-exts -cpp wc.hs $ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.01s system 67% cpu 0.018 total

Ok, slower than using length . lines. Lets try some other things.

Use the FFI

Try and step around the inefficent need to inspect each character in Haskell, by using memchr(3), the C function to find each newline for us.

{-# LANGUAGE BangPatterns #-} import Foreign import Foreign.ForeignPtr import Foreign.C.Types import System.Environment import qualified Data.ByteString as B main = do f <- head ` fmap ` getArgs B . readFile f >>= \ ( B . PS x _ l ) -> withForeignPtr x $ \ p -> go p l 0 0 where go :: Ptr Word8 -> Int -> Int -> Int -> IO () go ! p ! l ! n ! i | n >= l = print i | otherwise = do let p' = p ` plusPtr ` n q = memchr p' 0x0a ( fromIntegral ( l - n )) if q == nullPtr then print i else do let k = q ` minusPtr ` p' go p l ( n + k + 1 ) ( i + 1 ) foreign import ccall unsafe "string.h memchr" memchr :: Ptr Word8 -> CInt -> CSize -> Ptr Word8

$ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.00s system 47% cpu 0.017 total

Slowly inching forwards.

Read the Core

While we're here, we can check whether the strictness on the 'go' function makes any difference, by reading the GHC Core:

$ ghc -O -package fps -cpp -ffi wc.hs -ddump-simpl | less

Search for the 'go' function:

Main.$wgo :: GHC.Prim.Addr# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.IOBase.IO ()

And without the strictness:

Main.$wgo :: GHC.Ptr.Ptr GHC.Word.Word8 -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Base.Int -> GHC.IOBase.IO ()

So GHC is helpfully unboxing the Ptr Word8 into a raw machine Addr#.

Avoid some code

The guard that checks the length is unneeded, since memchr takes a length argument anyway. It also calculates the next pointer for us, so avoid recalculating it. (Note that this is equivalent to using the 'count' function, which has the same implementation).

import Foreign import Foreign.ForeignPtr import Foreign.C.Types import System.Environment import qualified Data.ByteString as B # define STRICT3 ( f ) f a b c | a ` seq ` b ` seq ` c ` seq ` False = undefined main = do f <- head ` fmap ` getArgs B . readFile f >>= \ ( B . PS x s l ) -> withForeignPtr x $ \ p -> go ( p ` plusPtr ` s ) ( fromIntegral l ) 0 where go :: Ptr Word8 -> CSize -> Int -> IO () STRICT3 ( go ) go p l i = do let q = memchr p 0x0a l if q == nullPtr then print i else do let k = fromIntegral $ q ` minusPtr ` p go ( q ` plusPtr ` 1 ) ( l - k ) ( i + 1 ) foreign import ccall unsafe "string.h memchr" memchr :: Ptr Word8 -> CInt -> CSize -> Ptr Word8

Checking the Core, 'go' is now:

Main.$wgo :: GHC.Prim.Addr# -> GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.IOBase.IO ()

The code is certainly a bit simpler, at least.

$ ghc -O -package fps -cpp -ffi wc.hs $ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.01s system 70% cpu 0.017 total

But we can't seem to squeeze any more out, at least on data this size.

Going via C

We reach a point where I can't think of any more tricks, so we can always code up a little C and call into that, for this tight loop. Sometimes we just have to do this, and that's what the ffi is for, after all.

-- wc.hs import Foreign import System.Environment import qualified Data.ByteString as B main = do f <- head ` fmap ` getArgs B . readFile f >>= \ ( B . PS x _ l ) -> withForeignPtr x $ \ p -> print ( c_wc p l ) foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int -- wc_c.c int wc ( char * p , int len ) { int c ; for ( c = 0 ; len --; ++p) if ( * p == '

' ) ++ c ; return c ; } -- wc.h int wc ( char * p , int len );

$ gcc -O3 -c wc_c.c $ ghc -O -package fps wc.hs -o wc -fglasgow-exts wc_c.o $ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total

And we are done. Note that the tight C loop didn't give us anything in the end over the naive ByteString code, which is a very satisfying result.