NoSlow: Microbenchmarks for Haskell array libraries

Over the last couple of days, I have implemented a small benchmark suite which tries to measure the performance of various Haskell array libraries, with particular emphasis on finding out how well they are able to fuse things. It is now on Hackage under the very creative and imaginative name NoSlow (Haskell seems to have gained a tradition of naming benchmark suites nosomething). What it does is compile and run a set of micro-benchmarks using these libraries:

standard lists

primitive arrays from the DPH project (package dph-prim-seq )

) uvector (a fork of the above)

vector (primitive, storable and boxed arrays)

storablevector

The actual benchmarking is done by Brian O’Sullivan’s wonderful criterion.

NoSlow is a very young project so the output isn’t particularly pretty but it does the job. It is also not very thoroughly tested and only includes a very restricted set of benchmarks so saying “library x is better than library y because it produces better numbers in NoSlow” would be completely and utterly wrong. However, it is quite useful for identifying cases where taking a closer look at the code generated for a particular loop might be a good idea. It already helped me spot one tricky performance regression in the standard list library.

Here is an example of the data it produces.

NoSlow results for GHC 6.13.20091124 (only loops specialised for Double) dph-prim list storablevector uvector vector seq Primitive Storable boxed $a+1 58.38504 us 192.9597 us 58.25747 us 58.47876 us 58.52366 us 58.77233 us 405.5711 us $a+^1 59.03597 us 356.6624 us 210.4899 us 80.25804 us 58.53107 us 59.19546 us 417.2587 us $a+x 55.62417 us 236.2620 us 55.63200 us 55.62767 us 55.33446 us 55.96459 us 477.2455 us $a+^x 56.91372 us 445.1638 us 237.9741 us 75.81757 us 55.79731 us 56.45521 us 491.3216 us $a+x+x+x+x 96.75086 us 280.3636 us 236.0562 us 96.76510 us 95.07297 us 97.15892 us 519.6680 us $a+$b 64.82325 us 255.3842 us 91.09551 us 64.82592 us 65.21805 us 65.10213 us 500.7479 us $a+$b(zip) 64.48503 us 347.4379 us 64.49701 us 499.5277 us x*$a+$b 75.63250 us 452.0626 us 154.3398 us 75.68617 us 110.5486 us 155.2830 us 649.9281 us ($a+$b)*($c+$d) 125.0923 us 1.080122 ms 288.3781 us 125.0150 us 919.2376 us 1.132326 ms 2.967417 ms ($a+$b)*($c+$d)(zip) 155.3547 us 742.2439 us 155.3554 us 2.509351 ms (x+$a)*(y+$b) 104.0011 us 678.5383 us 203.9229 us 103.9131 us 89.32382 us 89.98352 us 401.6301 us (^x+$a)*(^y+$b) 118.5413 us 1.061493 ms 534.0542 us 150.2891 us 822.7933 us 930.1098 us 2.650773 ms filter(neq0)(map0) 61.45542 us 78.80583 us 104.6501 us 61.35867 us 68.04449 us 61.67384 us 124.5576 us filter(neq0)(^0) 17.42000 us 23.41616 us 173.0071 us 17.91704 us 23.33224 us 24.97328 us 59.48926 us filter(eq0)(map0) 69.61809 us 174.2209 us 242.6149 us 69.63563 us 69.34671 us 69.89652 us 331.1557 us filter(eq0)(^0) 45.96399 us 94.22360 us 308.5451 us 45.75672 us 52.15233 us 45.52413 us 201.7559 us zip_filter 76.31172 us 466.6432 us 334.0208 us 76.24536 us 99.80086 us 100.4251 us 341.6012 us filter_zip 81.01137 us 468.8768 us 239.6135 us 79.83034 us 80.92583 us 87.66268 us 255.4129 us filter_evens 59.13895 us 182.2491 us 58.59793 us 337.2523 us sum($a*$b) 115.0304 us 1.130831 ms 145.4783 us 115.0499 us 73.86368 us 74.39775 us 99.39568 us sum($a*$b)(zip) 114.9057 us 1.420020 ms 114.8766 us 99.43429 us sum[m..n] 53.50359 us 646.0969 us 235.4800 us 53.76784 us 129.2202 us 133.0979 us 134.6730 us sumsq(map) 67.20919 us 653.9266 us 292.8844 us 67.21005 us 143.1072 us 142.2476 us 143.5639 us sumsq(zip) 67.24133 us 1.928390 ms 311.0818 us 160.3321 us 369.7680 us 380.4847 us 914.5373 us sum_evens 120.4421 us 401.7939 us 120.4255 us 150.9723 us

Benchmarks

At the moment, NoSlow only benchmarks a bunch of very small and fairly random loop kernels as I was more concerned with getting the infrastructure in place. Here is an example (called $a+$b in the table):

dotp as bs = sum (zipWith (*) as bs)

When it is built, NoSlow compiles several different versions of this loop. For instance, for the storablevector backend it will generate these two functions:

dotp :: (Storable a, Num a) => StorableVector a -> StorableVector a -> a dotp :: StorableVector Double -> StorableVector Double -> Double

The first one is overloaded on the element type whereas the second one is specialised to Double . NoSlow does this for all its backends and then runs all generated versions of the benchmark with the same test data. With more testing and more benchmarks, this will give us a rough estimate of the relative performance of the different array libraries. It also shows how much faster specialised loops are compared to their polymorphic versions

Here is the output of a full run. *Double and Double are overloaded and specialised loops, respectively. At the moment, NoSlow only does Double benchmarks but adding more types is easy.

NoSlow results for GHC 6.13.20091124 dph-prim list storablevector uvector vector seq Primitive Storable boxed *Double Double *Double Double *Double Double *Double Double *Double Double *Double Double *Double Double $a+1 388.9189 us 58.38504 us 334.0156 us 192.9597 us 310.7921 us 58.25747 us 410.9401 us 58.47876 us 831.8845 us 58.52366 us 467.6662 us 58.77233 us 563.7031 us 405.5711 us $a+^1 412.6220 us 59.03597 us 361.8679 us 356.6624 us 808.4398 us 210.4899 us 554.7731 us 80.25804 us 842.5514 us 58.53107 us 469.5773 us 59.19546 us 570.0256 us 417.2587 us $a+x 398.3495 us 55.62417 us 266.9966 us 236.2620 us 317.0817 us 55.63200 us 416.8037 us 55.62767 us 848.4946 us 55.33446 us 479.7607 us 55.96459 us 576.0136 us 477.2455 us $a+^x 401.3565 us 56.91372 us 376.1232 us 445.1638 us 821.9274 us 237.9741 us 560.3928 us 75.81757 us 854.3754 us 55.79731 us 483.5405 us 56.45521 us 579.5658 us 491.3216 us $a+x+x+x+x 1.099485 ms 96.75086 us 989.8561 us 280.3636 us 1.367973 ms 236.0562 us 1.108161 ms 96.76510 us 1.732897 ms 95.07297 us 1.402448 ms 97.15892 us 1.918533 ms 519.6680 us $a+$b 675.8037 us 64.82325 us 266.1987 us 255.3842 us 695.4756 us 91.09551 us 721.3087 us 64.82592 us 1.238228 ms 65.21805 us 758.5373 us 65.10213 us 579.0221 us 500.7479 us $a+$b(zip) 674.6217 us 64.48503 us 486.2640 us 347.4379 us 718.1258 us 64.49701 us 579.6009 us 499.5277 us x*$a+$b 843.1906 us 75.63250 us 576.2166 us 452.0626 us 1.086456 ms 154.3398 us 889.1795 us 75.68617 us 1.494543 ms 110.5486 us 877.4670 us 155.2830 us 1.226473 ms 649.9281 us ($a+$b)*($c+$d) 1.510817 ms 125.0923 us 1.003353 ms 1.080122 ms 1.958700 ms 288.3781 us 1.582407 ms 125.0150 us 3.109755 ms 919.2376 us 1.915603 ms 1.132326 ms 3.199406 ms 2.967417 ms ($a+$b)*($c+$d)(zip) 1.585162 ms 155.3547 us 1.345901 ms 742.2439 us 1.845875 ms 155.3554 us 2.749534 ms 2.509351 ms (x+$a)*(y+$b) 1.139386 ms 104.0011 us 969.7649 us 678.5383 us 1.451681 ms 203.9229 us 1.174562 ms 103.9131 us 1.950864 ms 89.32382 us 1.404879 ms 89.98352 us 2.030718 ms 401.6301 us (^x+$a)*(^y+$b) 1.095473 ms 118.5413 us 1.154678 ms 1.061493 ms 2.512015 ms 534.0542 us 1.530284 ms 150.2891 us 2.452523 ms 822.7933 us 1.825248 ms 930.1098 us 3.399711 ms 2.650773 ms filter(neq0)(map0) 357.0597 us 61.45542 us 320.2454 us 78.80583 us 581.4489 us 104.6501 us 364.2109 us 61.35867 us 609.2637 us 68.04449 us 531.9321 us 61.67384 us 370.0963 us 124.5576 us filter(neq0)(^0) 19.80998 us 17.42000 us 6.715188 us 23.41616 us 404.8197 us 173.0071 us 20.19333 us 17.91704 us 24.38774 us 23.33224 us 24.77845 us 24.97328 us 59.51524 us 59.48926 us filter(eq0)(map0) 527.7883 us 69.61809 us 499.8775 us 174.2209 us 839.2031 us 242.6149 us 542.4786 us 69.63563 us 1.015737 ms 69.34671 us 622.5425 us 69.89652 us 619.5380 us 331.1557 us filter(eq0)(^0) 153.7917 us 45.96399 us 87.08318 us 94.22360 us 648.8350 us 308.5451 us 162.6549 us 45.75672 us 417.1879 us 52.15233 us 126.1947 us 45.52413 us 214.1162 us 201.7559 us zip_filter 696.4477 us 76.31172 us 631.0826 us 466.6432 us 1.158844 ms 334.0208 us 735.2156 us 76.24536 us 1.042058 ms 99.80086 us 635.7235 us 100.4251 us 588.1413 us 341.6012 us filter_zip 741.1476 us 81.01137 us 467.9363 us 468.8768 us 1.073220 ms 239.6135 us 789.2609 us 79.83034 us 1.249209 ms 80.92583 us 835.3062 us 87.66268 us 507.6646 us 255.4129 us filter_evens 315.2792 us 59.13895 us 185.6894 us 182.2491 us 342.7927 us 58.59793 us 321.4253 us 337.2523 us sum($a*$b) 889.3029 us 115.0304 us 1.113731 ms 1.130831 ms 837.7596 us 145.4783 us 630.7635 us 115.0499 us 1.046405 ms 73.86368 us 956.1445 us 74.39775 us 310.2107 us 99.39568 us sum($a*$b)(zip) 893.2471 us 114.9057 us 1.890718 ms 1.420020 ms 624.3066 us 114.8766 us 371.0969 us 99.43429 us sum[m..n] 480.6165 us 53.50359 us 1.006370 ms 646.0969 us 599.5453 us 235.4800 us 257.9740 us 53.76784 us 520.8189 us 129.2202 us 455.0702 us 133.0979 us 450.9119 us 134.6730 us sumsq(map) 601.0236 us 67.20919 us 1.389623 ms 653.9266 us 916.6944 us 292.8844 us 361.3013 us 67.21005 us 633.0462 us 143.1072 us 632.7607 us 142.2476 us 626.6808 us 143.5639 us sumsq(zip) 581.3862 us 67.24133 us 2.487876 ms 1.928390 ms 1.391611 ms 311.0818 us 931.6956 us 160.3321 us 2.725828 ms 369.7680 us 1.892799 ms 380.4847 us 1.770848 ms 914.5373 us sum_evens 522.6244 us 120.4421 us 377.5469 us 401.7939 us 283.7662 us 120.4255 us 293.9641 us 150.9723 us

NoSlow can also be used to see how well different GHC versions optimise array loops. Here is a comparison of the current HEAD to GHC 6.10.4 (2 means 6.13 is 2x slower, 0.5 means it is 2x faster).

NoSlow results for GHC 6.13.20091124 vs GHC 6.10.4 dph-prim list storablevector uvector vector seq Primitive Storable boxed *Double Double *Double Double *Double Double *Double Double *Double Double *Double Double *Double Double $a+1 0.169 0.998 1.821 1.023 0.879 1.000 0.183 1.000 0.109 0.019 0.069 1.000 0.994 0.985 $a+^1 0.178 1.001 1.027 1.029 0.068 0.114 0.130 1.001 0.058 0.014 0.031 0.999 1.011 1.004 $a+x 0.172 1.004 1.138 0.989 0.970 1.006 0.185 1.005 0.121 0.018 0.072 1.002 1.024 0.965 $a+^x 0.173 0.982 1.034 1.023 0.068 0.127 0.132 0.992 0.060 0.013 0.032 1.007 1.012 0.979 $a+x+x+x+x 0.400 1.019 1.082 1.007 1.033 0.274 0.386 1.019 0.059 0.008 0.049 0.089 1.034 0.981 $a+$b 0.214 0.966 1.014 1.010 0.067 1.226 0.223 0.966 0.121 0.014 0.072 1.000 0.974 0.984 $a+$b(zip) 0.213 0.965 1.740 1.358 0.223 0.963 0.984 1.002 x*$a+$b 0.251 0.997 1.000 0.982 0.101 1.065 0.263 0.997 0.085 0.020 0.051 1.292 1.007 1.034 ($a+$b)*($c+$d) 0.287 0.199 1.168 1.483 0.064 0.271 0.293 0.199 0.101 0.061 0.060 1.058 0.992 1.299 ($a+$b)*($c+$d)(zip) 0.302 1.262 1.301 0.906 0.350 1.257 0.949 1.249 (x+$a)*(y+$b) 0.298 1.119 1.019 0.929 0.138 0.372 0.270 1.119 0.080 0.008 0.059 0.136 0.991 1.001 (^x+$a)*(^y+$b) 0.323 0.166 0.911 1.057 0.074 0.114 0.207 0.199 0.061 0.039 0.043 0.799 1.123 1.200 filter(neq0)(map0) 0.374 0.999 1.275 0.990 0.828 0.519 0.364 1.000 0.067 0.047 0.064 0.999 0.986 0.982 filter(neq0)(^0) 0.191 0.949 0.062 1.000 0.189 0.090 0.180 1.016 0.004 0.036 0.004 1.014 0.423 1.029 filter(eq0)(map0) 0.220 1.000 1.163 1.049 0.334 0.805 0.231 1.005 0.090 0.019 0.057 1.003 0.981 1.014 filter(eq0)(^0) 0.131 1.010 0.435 1.177 0.164 0.154 0.137 1.008 0.050 0.019 0.014 1.003 0.738 1.060 zip_filter 0.327 0.991 1.476 0.997 0.093 0.447 0.339 0.995 0.068 0.014 0.041 0.201 1.277 0.815 filter_zip 0.283 0.990 1.024 1.475 0.077 0.785 0.297 0.982 0.083 0.019 0.054 1.000 1.071 0.987 filter_evens 0.180 0.984 0.985 0.998 0.186 0.548 0.109 0.115 sum($a*$b) 0.167 1.475 0.982 0.978 0.074 1.115 0.141 1.476 0.090 0.941 0.079 0.955 0.120 0.642 sum($a*$b)(zip) 0.167 1.475 1.430 1.344 0.139 1.474 0.143 0.998 sum[m..n] 0.137 0.516 1.042 0.994 0.273 1.014 0.104 0.371 0.063 0.043 0.062 0.065 0.090 0.048 sumsq(map) 0.167 0.566 1.012 0.985 0.364 1.012 0.140 0.404 0.079 0.047 0.082 0.069 0.120 0.051 sumsq(zip) 0.088 0.557 1.009 0.968 0.115 1.002 0.165 1.017 0.153 0.061 0.107 0.180 0.272 0.309 sum_evens 0.190 2.114 0.938 1.041 0.115 1.379 0.078 0.053

Usage

The current interface is rather minimalistic. After building NoSlow, you get two binaries: noslow and noslow-table . The first one runs the actual benchmarks and produces a log file:

noslow -u log

The log can then be fed to noslow-table to generate HTML tables:

noslow-table log > table.html

We can also restrict the output to a particular data type:

noslow-table log --type=Double > table.html

or compare two logs:

noslow-table --diff log1 log2 > table.html

In this case, the table will contain ratios of corresponding values from the two logs, just like in the table above.

Compiling

NoSlow should work out of the box with GHC 6.10. I haven’t tried it with 6.12 because I didn’t have time to install the release candidate and the required packages. Getting it to build with the HEAD (which is what I’m interested in) is a rather annoying process, though, because uvector is missing a DPH patch which would make it work with the new IO system. Fortunately, I could simply transplant some code from the DPH library (the module is dph-base/Data/Array/Parallel/Data/Array/Parallel/Arr/BUArr.hs ) and everything worked.

Also, storablevector depends on QuickCheck 1 which doesn’t seem to compile with the HEAD. I just removed all references to QuickCheck from the code but it might be easier to simply not build the storablevector backend (there is a flag for that).

How it works

Benchmarking itself is rather straighforward – all the exciting bits are in criterion. The heart of NoSlow is the specialiser – a Template Haskell function that takes a bunch of abstract loop kernels and specialises them for a particular backend and data type. Here is how it works. The kernels themselves are implemented in a TH quote:

kernels = [d| ... dotp :: (Num a, I.Vector v a) => Ty (v a) -> v a -> v a -> a dotp _ as bs = named "sum($a*$b)" $ I.sum (I.zipWith (*) as bs) ... |]

They refer to functions from module I (short for NoName.Backend.Interface ) which are just stubs and do not provide any real functionality. The specialiser traverses the ASTs of the kernels (which it has access to since they are quoted) and replaces references to those stubs with calls to functions from the backend that it is specialising for. For instance, it replaces I.sum by NoSlow.Backend.List.sum (which is just a reexport of the Prelude function) when specialising for lists. It also adjusts signatures and looks for tags like named in the example above which defines the kernel’s name in the HTML output.

This scheme has one big advantage: it allows backends to mark some kernels as unsupported. An example is storablevector which does not support arrays of pairs. The corresponding backend defines zip like this:

zip :: Undefined zip = Undefined

When replacing I.zip with NoSlow.Backend.StorableVector.zip , the specialiser reifies the latter, notices that its type is Unsupported and omits the kernel that contains the call to zip altogether.