[Haskell-cafe] Re: speed: ghc vs gcc

claus.reinke: > Concrete examples always help, thanks. > > In simple enough situations, one can roll one's own loop unrolling;), > somewhat like shown below (worker/wrapper split to bring the function > parameter representing the loop body into scope, then template haskell > to unroll its applications syntactically, then optimization by > transformation > to get rid of the extra code). It is all rather more complicated than one > would like it to be, what with TH scoping restrictions and all, but > perhaps a library of self-unrolling loop combinators along these lines > might help, as a workaround until ghc does its own unrolling. > > Claus > > {-# LANGUAGE TemplateHaskell #-} > module Apply where > import Language.Haskell.TH.Syntax > apply i bound | i<bound = [| \f x -> $(apply (i+1) bound) f (f i x) |] > | otherwise = [| \f x -> x |] > > {-# LANGUAGE CPP #-} > {-# LANGUAGE TemplateHaskell #-} > {-# LANGUAGE BangPatterns #-} > {-# OPTIONS_GHC -DN=8 -ddump-splices #-} > module Main(main) where > import Apply > main = print $ loopW 1 (10^9) body 0 > > {-# INLINE loopW #-} > loopW :: Int -> Int -> (Int -> Int -> Int) -> Int -> Int > loopW i max body acc = loop i acc > where > loop :: Int -> Int -> Int > loop !i !acc | i+N<=max = loop (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc) > {- > loop !i !acc | i+8<=max = loop (i+8) ( body (i+7) > $ body (i+6) > $ body (i+5) > $ body (i+4) > $ body (i+3) > $ body (i+2) > $ body (i+1) > $ body i acc) > -} > loop !i !acc | i<=max = loop (i+1) (body i acc) > | otherwise = acc > > body :: Int -> Int -> Int > body !i !acc = i+acc > Great thinking! This is EXTREMELY COOL! Main.hs:15:42-57: Splicing expression let apply = apply $dOrd = GHC.Base.$f1 $dNum = GHC.Num.$f6 $dLift = Language.Haskell.TH.Syntax.$f18 in apply (0 :: Int) 8 ======> \ f[a1KU] x[a1KV] -> \ f[a1KW] x[a1KX] -> \ f[a1KY] x[a1KZ] -> \ f[a1L0] x[a1L1] -> \ f[a1L2] x[a1L3] -> \ f[a1L4] x[a1L5] -> \ f[a1L6] x[a1L7] -> \ f[a1L8] x[a1L9] -> \ f[a1La] x[a1Lb] -> x[a1Lb] f[a1L8] (f[a1L8] 7 x[a1L9]) f[a1L6] (f[a1L6] 6 x[a1L7]) f[a1L4] (f[a1L4] 5 x[a1L5]) f[a1L2] (f[a1L2] 4 x[a1L3]) f[a1L0] (f[a1L0] 3 x[a1L1]) f[a1KY] (f[a1KY] 2 x[a1KZ]) f[a1KW] (f[a1KW] 1 x[a1KX]) f[a1KU] (f[a1KU] 0 x[a1KV]) In the second argument of `loop', namely `($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc)' In the expression: loop (i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc) In the definition of `loop': loop !i !acc | i + 8 <= max = loop (i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc) So, that's the fastest yet: $ time ./Main 500000000500000000 ./Main 0.61s user 0.00s system 98% cpu 0.623 total And within 2x the best GCC was doing, gcc -O3 -funroll-loops 0.318 If we unroll even further... $ ghc -O2 -fvia-C -optc-O3 -D64 Main.hs $ time ./Main 500000000500000000 ./Main 0.08s user 0.00s system 94% cpu 0.088 total Very very nice, Claus! Now I'm wondering if we can do this via rewrite rules.... -- Don