The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:

void nibble_sort(unsigned long *buf);

I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3” you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.

Here’s a slow (but hopefully correct) reference implementation takes about 180 380 microseconds to do the job:

int read_nibble(unsigned long w, int i) { assert(i >= 0 && i < 16); unsigned long res = w >> (i * 4); return res & 0xf; } void write_nibble(unsigned long *w, int i, int v) { assert(i >= 0 && i < 16); unsigned long mask = 0xf; mask <<= (i * 4); *w &= ~mask; unsigned long prom = v; prom <<= (i * 4); *w |= prom; } unsigned long nibble_sort_word(unsigned long arg) { for (int i = 0; i < 16; ++i) { int min = i; for (int j = i+1; j < 16; ++j) { if (read_nibble(arg, j) < read_nibble(arg, min)) min = j; } if (min != i) { int tmp = read_nibble(arg, i); write_nibble(&arg, i, read_nibble(arg, min)); write_nibble(&arg, min, tmp); } } return arg; } void nibble_sort(unsigned long *buf) { for (int i=0; i<1024; i++) buf[i] = nibble_sort_word(buf[i]); }

Update from Feb 1: Ok, contest is over. Thanks for the very impressive submissions, folks! Provisional results with and w/o SIMD are below. These are with turbo boost turned off. I’m now working on the blog post explaining the results, which is going to take some time because there are a lot of solutions here and some of them are very clever. I’m also still working to secure permission to use codes so I can put as many as possible in Github. There are fewer entries in the no-SIMD category because many codes had explicit use of vectors. Please let me know if you think I’ve made any mistakes.

Update from Feb 2: There were a few mistakes and omissions. Results that I hope are final are below.

regehr@regehr-M51AC:nibble_sort$ ./go.sh ns ns / 8 bytes entry name errors 1115 1.09 alexander4 0 1238 1.21 arseny2 0 1596 1.56 arseny1 0 1755 1.71 uwe 0 2630 2.57 pdewacht2 0 4205 4.11 beekman2 0 4466 4.36 beekman1 0 4867 4.75 alexander3 0 6084 5.94 alexander1 0 6364 6.21 koorogi2 0 8472 8.27 falk 0 8520 8.32 jerome 0 10331 10.09 vetter4 0 10374 10.13 jepler 0 10950 10.69 alexander2 0 12096 11.81 vetter3 64 12714 12.42 koorogi1 0 14185 13.85 tom 0 15532 15.17 rjk9 0 16829 16.43 parnell 0 16890 16.49 pdewacht1 0 18804 18.36 chucky 0 20041 19.57 burton2 0 20649 20.17 nadav 0 24908 24.32 bosmans1 60 25104 24.52 bosmans2 0 25486 24.89 vetter2 0 28957 28.28 hans 0 29928 29.23 anon 0 30228 29.52 jrose 0 31007 30.28 carlos 0 32952 32.18 joris 0 34562 33.75 vetter1 0 45440 44.38 mats 0 47511 46.40 frank 0 50162 48.99 robin 0 72762 71.06 grayson 0 74465 72.72 rosen 0 79752 77.88 payer 0 92970 90.79 burton1 0 94343 92.13 mentre 0 95522 93.28 vlad 0 97877 95.58 rogers 0 99077 96.75 mike 0 101913 99.52 bloom 0 103945 101.51 jarkko 0 109191 106.63 karim 0 120928 118.09 justin 0 160599 156.83 mikael 0 200943 196.23 scotty 0 416207 406.45 ref 0 regehr@regehr-M51AC:nibble_sort$ ./go.sh NOSIMD ns ns / 8 bytes entry name errors 8593 8.39 jerome 0 10321 10.08 jepler 0 10346 10.10 vetter4 0 10959 10.70 alexander2 0 12105 11.82 vetter3 64 15537 15.17 rjk9 0 16842 16.45 parnell 0 18847 18.41 chucky 0 19995 19.53 pdewacht1 0 21631 21.12 nadav 0 25516 24.92 vetter2 0 29418 28.73 hans 0 29928 29.23 anon 0 30529 29.81 jrose 0 32956 32.18 joris 0 34286 33.48 falk 0 41028 40.07 burton2 0 42308 41.32 vetter1 0 44415 43.37 bosmans2 0 45480 44.41 mats 0 50179 49.00 robin 0 53189 51.94 frank 0 71285 69.61 grayson 0 73883 72.15 rosen 0 79744 77.88 payer 0 94780 92.56 burton1 0 94970 92.74 mentre 0 95509 93.27 vlad 0 98658 96.35 rogers 0 98922 96.60 mike 0 101722 99.34 bloom 0 104880 102.42 jarkko 0 113147 110.50 karim 0 119168 116.38 justin 0 415529 405.79 ref 0

Update from Feb 3: Here’s the Github repo containing all of the entries — let me know how it looks.