This is only a partial answer trying to address the second question:

I tried something like this using GHC.IO.Buffer API:

module Main where import System.IO import System.Environment import GHC.IO.Buffer import Data.ByteString as BS import Control.Monad -- Copied from cat source code bufsize = 1024*128 go handle bufPtr = do read <- hGetBuf handle bufPtr bufsize when (read > 0) $ do hPutBuf stdout bufPtr read go handle bufPtr main = do file <- fmap Prelude.head getArgs handle <- openFile file ReadMode buf <- newByteBuffer bufsize WriteBuffer withBuffer buf $ go handle

and it seems to come closer to the performance of 'cat', but still definitely slower...

time ./Cat huge > /dev/null ./Cat huge > /dev/null 0.00s user 0.06s system 76% cpu 0.081 total time cat huge > /dev/null cat huge > /dev/null 0.00s user 0.05s system 75% cpu 0.063 total

I think using the buffer API, we can explicitly avoid allocating all the buffer bytestrings when using hGetSome like in the original code, but I am just guessing here and don't know either what exactly is happening in both compiled codes...

UPDATE: Adding the original code's performance on my laptop:

time ./Cat2 huge > /dev/null ./Cat2 huge > /dev/null 0.12s user 0.10s system 99% cpu 0.219 total

UPDATE 2: Adding some basic profiling results:

Original Code:

Cat2 +RTS -p -RTS huge total time = 0.21 secs (211 ticks @ 1000 us, 1 processor) total alloc = 6,954,068,112 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 100.0 100.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 46 0 100.0 100.0 100.0 100.0 CAF GHC.IO.Handle.FD 86 0 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 82 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 80 0 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 79 0 0.0 0.0 0.0 0.0 CAF System.Posix.Internals 75 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 72 0 0.0 0.0 0.0 0.0

Buffer-API Code:

Cat +RTS -p -RTS huge total time = 0.06 secs (61 ticks @ 1000 us, 1 processor) total alloc = 3,487,712 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 100.0 98.9 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 44 0 100.0 98.9 100.0 100.0 CAF GHC.IO.Handle.FD 85 0 0.0 1.0 0.0 1.0 CAF GHC.Conc.Signal 82 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 80 0 0.0 0.1 0.0 0.1 CAF GHC.IO.FD 79 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 71 0 0.0 0.0 0.0 0.0

Notice especially the big difference in allocation costs...