Fastest non-secure hash function!?



I've made a few variants of the famous Fowler / Noll / Vo hash a.k.a. FNV .

Mr. Noll 's site being the home of FNV:





Add on 2016-Feb-02:

All of my code is 100% FREE, it is even freer than "Public Domain", so use it without any fear of license issues, my word. Sasha, use it freely.



The name of the game is 'Fastest hasher for short strings'. For longer (than ordinary words) strings i.e. phrases/sentences here comes the FNV1A_Jesteress - simply the fastest hasher[ess] to my knowledge, so far, gladly I will update this statement when a faster one is found/e-mailed. FNV1A_YoshimitsuTRIAD proves to be the fastest (with higher latency, though) 32bit linear slasher, on i7-3930K 3.9GHz achieving 12GB/s and on Phenom II Thuban 4.0GHz achieving 11GB/s . Well, the best (not regarding LATENCY nor BANDWIDTH, i.e. overall) 32bit hash so far is FNV1A_Yorikke . Speaking of 64bit, FNV1A_Tesla has a much worse distribution but hashes lightningly fast especially for 32++bytes keys, it is most suitable when the collision managing is done by superfast lookuper as well, obviously it needs some refinements - here revision 2 is given. "No High-Speed Limit" , says Tesla. When the GOLDEN Yorikke was Interleaved&Interlaced a DIAMANT appeared - it's time for a new generation slasher: FNV1A_Yoshimura , right said Tesla.









Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):

Linear speed on Fantasy's Black-and-Red Rig (i7-3930K, 4500MHz, CPU bus: 125MHz, RAM bus: 2400MHz Quad Channel):

I've made a few variants of the famoushash a.k.a.Mr.'s site being the home of FNV: http://www.isthe.com/chongo/src/fnv/fnv-5.0.2.tar.gz The name of the game is 'Fastest hasher for short strings'. For longer (than ordinary words) strings i.e. phrases/sentences here comes the- simply the fastest hasher[ess] to my knowledge, so far, gladly I will update this statement when a faster one is found/e-mailed.proves to be the fastest (with higher latency, though) 32bit linear slasher, onachievingand onachieving. Well, the best (not regarding LATENCY nor BANDWIDTH, i.e. overall) 32bit hash so far is. Speaking of 64bit,has a much worse distribution but hashes lightningly fast especially for 32++bytes keys, it is most suitable when the collision managing is done by superfast lookuper as well, obviously it needs some refinements - here revision 2 is given., says Tesla. When the GOLDEN Yorikke was Interleaved&Interlaced a DIAMANT appeared - it's time for a new generation slasher:, right said Tesla.Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):Linear speed on Fantasy's Black-and-Red Rig (i7-3930K, 4500MHz, CPU bus: 125MHz, RAM bus: 2400MHz Quad Channel):

Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 4584 clocks or 14.297MB per clock BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 4645 clocks or 14.109MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 5623 clocks or 11.655MB per clock FNV1A_Yorikke: (64MB block); 65536MB hashed in 6212 clocks or 10.550MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 5329 clocks or 12.298MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 37555 clocks or 1.745MB per clock Fetching/Hashing a 10MB block 8*1024 times ... BURST_Read_4DWORDS: (10MB block); 81920MB fetched in 4726 clocks or 17.334MB per clock BURST_Read_3DWORDS: (10MB block); 81920MB fetched in 4850 clocks or 16.891MB per clock FNV1A_YoshimitsuTRIAD: (10MB block); 81920MB hashed in 6363 clocks or 12.874MB per clock FNV1A_Yorikke: (10MB block); 81920MB hashed in 7173 clocks or 11.421MB per clock FNV1A_Yoshimura: (10MB block); 81920MB hashed in 6121 clocks or 13.383MB per clock CRC32_SlicingBy8K2: (10MB block); 81920MB hashed in 46394 clocks or 1.766MB per clock Fetching/Hashing a 5MB block 8*1024 times ... BURST_Read_4DWORDS: (5MB block); 40960MB fetched in 2046 clocks or 20.020MB per clock BURST_Read_3DWORDS: (5MB block); 40960MB fetched in 2091 clocks or 19.589MB per clock FNV1A_YoshimitsuTRIAD: (5MB block); 40960MB hashed in 2877 clocks or 14.237MB per clock FNV1A_Yorikke: (5MB block); 40960MB hashed in 3333 clocks or 12.289MB per clock FNV1A_Yoshimura: (5MB block); 40960MB hashed in 2929 clocks or 13.984MB per clock CRC32_SlicingBy8K2: (5MB block); 40960MB hashed in 22909 clocks or 1.788MB per clock Fetching/Hashing a 2MB block 32*1024 times ... BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 3207 clocks or 20.435MB per clock BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 3296 clocks or 19.883MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 4554 clocks or 14.391MB per clock FNV1A_Yorikke: (2MB block); 65536MB hashed in 5285 clocks or 12.400MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 4630 clocks or 14.155MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 36538 clocks or 1.794MB per clock Fetching/Hashing a 128KB block 512*1024 times ... BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 2433 clocks or 26.936MB per clock BURST_Read_3DWORDS: (128KB block); 65536MB fetched in 2627 clocks or 24.947MB per clock FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 4388 clocks or 14.935MB per clock FNV1A_Yorikke: (128KB block); 65536MB hashed in 5163 clocks or 12.693MB per clock FNV1A_Yoshimura: (128KB block); 65536MB hashed in 4553 clocks or 14.394MB per clock CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 36238 clocks or 1.808MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 1968 clocks or 33.301MB per clock BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 2600 clocks or 25.206MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 4393 clocks or 14.918MB per clock FNV1A_Yorikke: (16KB block); 65536MB hashed in 5126 clocks or 12.785MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 4551 clocks or 14.400MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 36227 clocks or 1.809MB per clock







Update, 2013-Jul-21: Yesterday seeing how the t0tum's Core i7 4770K (x44 core, x44 uncore) broke the 1TB/s L1 cache READ (with new AIDA64 which reports bandwidth using multi-threaded tests) I prepared a new FNV1A_Yoshimura variant called FNV1A_farolito. 'FAROLITO' may shine on HASWELL, still it is untested/untortured.







For reference:



Update, 2013-Jun-25: Having failed to outrun 'xxhash256' at least I recalled a superb verse from who-knows-wherefrom: "[One's] Everything was/is not enough", which invoked another old one: "The most important thing is defeat to come NOT from within".

Here comes the dispersion 'TRISMUS' 1+ trillion 1Kb text chunks torture

After some 384h of hashing Knight-Tour derivatives on T7500 2200Mhz:

The above dump was made with HASH_linearspeed_Yoshimura.zip package, please share your 'Results.txt'.Yesterday seeing how the's Core i7 4770K (x44 core, x44 uncore) broke the 1TB/s L1 cache READ (with new AIDA64 which reports bandwidth using multi-threaded tests) I prepared a new FNV1A_Yoshimura variant called FNV1A_farolito. 'FAROLITO' may shine on HASWELL, still it is untested/untortured.For reference: www.sanmayce.com/Fastest_Hash/index.html#farolito Having failed to outrun 'xxhash256' at least I recalled a superb verse from who-knows-wherefrom: "[One's] Everything was/is not enough", which invoked another old one: "The most important thing is defeat to come NOT from within".Here comes the dispersion 'TRISMUS' 1+ trillion 1Kb text chunks torture package.After some 384h of hashing Knight-Tour derivatives on T7500 2200Mhz:

FNV1A_YoshimitsuTRIADii: KT_DumpCounter = 0,000,134,217,729; 000,000,001 x MAXcollisionsAtSomeSlots = 000,012; HASHfreeSLOTS = 0,050,530,128 CRC32 0x8F6E37A0, iSCSI: KT_DumpCounter = 0,000,134,217,729; 000,000,004 x MAXcollisionsAtSomeSlots = 000,011; HASHfreeSLOTS = 0,049,561,215 ... FNV1A_YoshimitsuTRIADii: KT_DumpCounter = 1,000,056,291,329; 000,000,001 x MAXcollisionsAtSomeSlots = 007,930; HASHfreeSLOTS = 0,000,000,000 CRC32 0x8F6E37A0, iSCSI: KT_DumpCounter = 1,000,056,291,329; 000,000,002 x MAXcollisionsAtSomeSlots = 007,910; HASHfreeSLOTS = 0,000,000,000 So, 'TRISMUS' rev. E says: For 1,000,056,291,329:134,217,727 = 7,451:1 ratio the DFTID (deviation-from-the-ideal-dispersion) is: DFTID = (MAX_depthness-(NumberOfKeys+1)/Slots) / ((NumberOfKeys+1)/Slots) * 100% or FNV1A_YoshimitsuTRIADii's DFTID = (7,930-7,451)/7,451*100% = 6.4%

Thanks to m^2 some results obtained on AMCC 440EPX (aka. sequoia) with his benchmarking package come here: On PowerPC 440 'FNV1a-YoshimitsuTRIAD' slashes MUTSI, it is written well as if I knew what I was doing, he-he.

PPC440EPx Features PowerPC 440 Processor The PowerPC 440 processor is designed for high-end applications: RAID controllers, SAN, iSCSI, routers, switches, printers, set-top boxes, etc. It implements the Book E PowerPC embedded architecture and uses the 128-bit version of IBM's on-chip CoreConnect Bus Architecture. Features include: * Up to 667MHz operation * PowerPC Book E architecture * 32KB I-cache, 32KB D-cache - UTLB Word Wide parity on data and tag address parity with exception force * Three logical regions in D-cache: locked, transient, normal * D-cache full line flush capability * 41-bit virtual address, 36-bit (64GB) physical address * Superscalar, out-of-order execution * 7-stage pipeline * 3 execution pipelines * Dynamic branch prediction * Memory management unit - 64-entry, full associative, unified TLB with optional parity - Separate instruction and data micro-TLBs - Storage attributes for write-through, cache-inhibited, guarded, and big or little endian * Debug facilities - Multiple instruction and data range breakpoints - Data value compare - Single step, branch, and trap events - Non-invasive real-time trace interface * 24 DSP instructions - Single cycle multiply and multiply-accumulate - 32 x 32 integer multiply - 16 x 16 -> 32-bit MAC System: cpu 440EPX clock 666.666670MHz platform PowerPC 44x Platform model amcc,sequoia Memory 255 MB Software: switches fsbench 0.14.2 s400 -i3 -c -o200 gcc 4.4.6 -O3 -fno-tree-vectorize Data: 16 KB of random data Codec: Speed (MB/s) Speed (ticks/B) FNV1a-YoshimitsuTRIAD 1122.19 0.57 FNV1a-Yoshimura 1078.13 0.59 FNV1a-Yorikke 1072.76 0.59 fletcher2 763.72 0.83 FNV1a-Jesteress 690.14 0.92 xxhash 580.14 1.10 xxhash256 490.96 1.29 murmur3_x86_128 380.91 1.67 CityHash32 339.51 1.87 SpookyHash 255.89 2.48 fletcher4 226.00 2.81 CityHash64 179.86 3.53 murmur3_x64_128 177.91 3.57 CityHash128 162.04 3.92 murmur3_x86_32 146.97 4.33 Software: switches fsbench 0.14.2 -i10 -s1000 gcc 4.4.6 -O3 -fno-tree-vectorize Data: tar of scc files trimmed to 1 MB each Codec Speed (MB/s) Speed (ticks/B) FNV1a-YoshimitsuTRIAD 219.66 2.89 FNV1a-Yorikke 216.21 2.94 FNV1a-Yoshimura 204.94 3.10 fletcher2 203.59 3.12 xxhash256 193.37 3.29 FNV1a-Jesteress 193.37 3.29 xxhash 184.26 3.45 murmur3_x86_128 164.06 3.88 CityHash32 157.60 4.03 uhash 149.14 4.26 CityHash64 134.46 4.73 SpookyHash 134.46 4.73 fletcher4 127.17 5.00 CityHash128 111.01 5.73 murmur3_x64_128 109.50 5.81 murmur3_x86_32 96.70 6.58 vhash 76.79 8.28

Also, he did an interesting speed test on ARM Cortex-A8:

Cortex-A8 Features NEON: 128-bit SIMD engine enables high performance media processing. Using NEON for some Audio, Video, and Graphics workloads eases the burden of supporting more dedicated accelerators across the SoC and enables the system to support the standards of tomorrow Optimized Level 1 cache: The Level 1 cache is integrated tightly into the processor with a single-cycle access time. The caches combine minimal access latency with hash way determination to maximize performance and minimize power consumption. Integrated Level 2 cache: The Level 2 cache, integrated into the core, provides ease of integration, power efficiency, and optimal performance. Built using standard compiled RAMs, the cache is configurable from 0K - 1MB. The cache can be built using compiled memories and has programmable delay to accommodate different array characteristics Thumb-2 Technology: Delivers the peak performance of traditional ARM code while also providing up to a 30% reduction in memory required to store instructions. Dynamic Branch Prediction: To minimize branch wrong prediction penalties, the dynamic branch predictor achieves 95% accuracy across a wide range of industry benchmarks. The Predictor is enabled by branch target and global history buffers. The replay mechanism minimizes miss-predict penalty. Memory Management Unit: A full MMU enables the Cortex-A8 to run rich operating systems in a variety of Applications Jazelle-RCT Technology: RCT Java-acceleration technology to optimize Just In Time (JIT) and Dynamic Adaptive Compilation (DAC), and reduce memory footprint by up to three times Memory System: Optimized for power-efficiency and high-performance. Hash array in the L1 cache limits activation of the memories to when they are likely to be needed. Direct interface between the integrated, configurable L2 cache and the NEON media unit for data streaming. Banked L2 cache design that enables only one bank at a time. Support for multiple outstanding transactions to the L3 memory to fully utilize the CPU. CortexA8 1 thread @720 Mhz L1 cache Codec Speed (MB/s) Ticks/B FNV1a-YoshimitsuTRIAD 1021.79 0.67 FNV1a-Yoshimura 989.00 0.69 FNV1a-Yorikke 988.45 0.69 fletcher2 834.93 0.82 FNV1a-Jesteress 682.55 1.01 xxhash 520.19 1.32 fletcher4 389.82 1.76 xxhash256 354.69 1.94 murmur3_x86_128 303.36 2.26 SpookyHash 229.25 3.00 murmur3_x86_32 136.94 5.01 murmur3_x64_128 133.48 5.14 uhash 121.58 5.65 vhash 113.45 6.05 CityHash32 84.42 8.13 CityHash128 82.57 8.32 CityHash64 79.14 8.68









For reference:



Update, 2013-Jun-16: Very nice versatile hash benchmarking (on 3 platforms: AMD/Intel/Cortex) m^2 where 'xxhash256' (written by Cyan and m^2) dominates MONSTROUSLY in L1/L2 cache linear speed hashing. So my 'secret weapon' i.e. FNV1A_YoshimitsuTRIADiiXMM got unrolled in order to face the 'monster', its new name is FNV1A_YoshimitsuTRIADiiXMMx2 aka FNV1A_penumbra, its C source and its ASM main loop are given in link below.

The dump below was made with

I hope m^2 to continue his quest on different platforms, it will be beneficial for many coders.For reference: www.sanmayce.com/Fastest_Hash/index.html#PENUMBRA Very nice versatile hash benchmarking (on 3 platforms: AMD/Intel/Cortex) here was done bywhere 'xxhash256' (written byand) dominates MONSTROUSLY in L1/L2 cache linear speed hashing. So my 'secret weapon' i.e. FNV1A_YoshimitsuTRIADiiXMM got unrolled in order to face the 'monster', its new name is FNV1A_YoshimitsuTRIADiiXMMx2 aka FNV1A_penumbra, its C source and its ASM main loop are given in link below.The dump below was made with HASH_linearspeed_FURY_Intel_32bit_64bit_PENUMBRA.zip package.

D:\_KAZE\HASH_linearspeed_FURY_Intel_32bit_64bit_PENUMBRA>dir Directory of D:\_KAZE\HASH_linearspeed_FURY_Intel_32bit_64bit_PENUMBRA 06/16/2013 04:21 PM 172,850 HASH_linearspeed_FURY.c 06/16/2013 04:21 PM 1,535,862 HASH_linearspeed_FURY_Intel_64bit_12.cod 06/16/2013 04:21 PM 135,168 HASH_linearspeed_FURY_Intel_64bit_12.exe 06/16/2013 04:21 PM 4,812 HASH_linearspeed_FURY_Intel_64bit_12_XXHASH256_mainloop.cod.txt 06/16/2013 04:21 PM 1,637,166 HASH_linearspeed_FURY_Intel_IA-32_12.cod 06/16/2013 04:21 PM 124,416 HASH_linearspeed_FURY_Intel_IA-32_12.exe 06/16/2013 04:21 PM 6,771 HASH_linearspeed_FURY_Intel_IA-32_12_PENUMBRA_mainloop.cod.txt 06/16/2013 04:21 PM 314 KAZE_compile_Intel32.bat 06/16/2013 04:21 PM 314 KAZE_compile_Intel64.bat 06/16/2013 04:21 PM 6,833 RESULTS_T7500.TXT 06/16/2013 04:21 PM 109 RUNME.bat 06/16/2013 04:21 PM 1,590 Yorikke prompt.lnk D:\_KAZE\HASH_linearspeed_FURY_Intel_32bit_64bit_PENUMBRA>RUNME.bat D:\_KAZE\HASH_linearspeed_FURY_Intel_32bit_64bit_PENUMBRA>HASH_linearspeed_FURY_Intel_IA-32_12.exe1>RESULTS.TXT D:\_KAZE\HASH_linearspeed_FURY_Intel_32bit_64bit_PENUMBRA>HASH_linearspeed_FURY_Intel_64bit_12.exe1>>RESULTS.TXT The 64bit results, HASH_linearspeed_FURY_Intel_64bit_12.exe: Memory pool starting address: 00000000005A0080 ... 64 byte aligned, OK Info1: One second seems to have 998 clocks. Info2: This CPU seems to be working at 2,191 MHz. Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... XXH_256: (64MB block); 65536MB hashed in 15975 clocks or 4.102MB/4.099MB per clock FNV1A_penumbra: (64MB block); 65536MB hashed in 14710 clocks or 4.455MB/4.517MB per clock FNV1A_YoshimitsuTRIADiiXMM: (64MB block); 65536MB hashed in 13900 clocks or 4.715MB/4.763MB per clock FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14680 clocks or 4.464MB/4.488MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 16130 clocks or 4.063MB/4.055MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14867 clocks or 4.408MB/4.418MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71261 clocks or 0.920MB/0.920MB per clock Fetching/Hashing a 2MB block 32*1024 times ... XXH_256: (2MB block); 65536MB hashed in 6022 clocks or 10.883MB/10.883MB per clock FNV1A_penumbra: (2MB block); 65536MB hashed in 6380 clocks or 10.272MB/10.272MB per clock FNV1A_YoshimitsuTRIADiiXMM: (2MB block); 65536MB hashed in 6786 clocks or 9.658MB/ 9.703MB per clock FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 10374 clocks or 6.317MB/ 6.279MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 10576 clocks or 6.197MB/ 6.215MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 10935 clocks or 5.993MB/ 5.942MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 68125 clocks or 0.962MB/ 0.962MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... XXH_256: (16KB block); 65536MB hashed in 6037 clocks or 10.856MB/10.941MB per clock FNV1A_penumbra: (16KB block); 65536MB hashed in 5991 clocks or 10.939MB/10.941MB per clock FNV1A_YoshimitsuTRIADiiXMM: (16KB block); 65536MB hashed in 6021 clocks or 10.885MB/10.883MB per clock FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 9469 clocks or 6.921MB/ 6.921MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 10250 clocks or 6.394MB/ 6.384MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9937 clocks or 6.595MB/ 6.595MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 67595 clocks or 0.970MB/ 0.970MB per clock The 32bit results, HASH_linearspeed_FURY_Intel_IA-32_12.exe: Memory pool starting address: 00AC0040 ... 64 byte aligned, OK Info1: One second seems to have 998 clocks. Info2: This CPU seems to be working at 2,191 MHz. Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... XXH_256: (64MB block); 65536MB hashed in 35787 clocks or 1.831MB/1.831MB per clock FNV1A_penumbra: (64MB block); 65536MB hashed in 14445 clocks or 4.537MB/4.581MB per clock FNV1A_YoshimitsuTRIADiiXMM: (64MB block); 65536MB hashed in 14056 clocks or 4.662MB/4.752MB per clock FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14773 clocks or 4.436MB/4.488MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 16115 clocks or 4.067MB/4.087MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14914 clocks or 4.394MB/4.436MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71573 clocks or 0.916MB/0.916MB per clock Fetching/Hashing a 2MB block 32*1024 times ... XXH_256: (2MB block); 65536MB hashed in 33212 clocks or 1.973MB/ 1.972MB per clock FNV1A_penumbra: (2MB block); 65536MB hashed in 6568 clocks or 9.978MB/10.025MB per clock FNV1A_YoshimitsuTRIADiiXMM: (2MB block); 65536MB hashed in 7316 clocks or 8.958MB/ 8.976MB per clock FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB/ 6.854MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB/ 6.722MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 10311 clocks or 6.356MB/ 6.483MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69763 clocks or 0.939MB/ 0.940MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... XXH_256: (16KB block); 65536MB hashed in 33415 clocks or 1.961MB/ 1.957MB per clock FNV1A_penumbra: (16KB block); 65536MB hashed in 5819 clocks or 11.262MB/11.293MB per clock !!! Giga Shadow !!! FNV1A_YoshimitsuTRIADiiXMM: (16KB block); 65536MB hashed in 6973 clocks or 9.399MB/ 9.399MB per clock FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8908 clocks or 7.357MB/ 7.370MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 8986 clocks or 7.293MB/ 7.294MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9688 clocks or 6.765MB/ 6.754MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69467 clocks or 0.943MB/ 0.949MB per clock

What can I say, despite the hiccup (in L2 64bit results) 'PENUMBRA' overshadows everything up to 2013-Jun-16, no?

Yet, 'xxhash256' being XMMless function (hashing 128 bytes per loop) proves to be almost as fast as

THE monster-slasher 'FNV1A_penumbra' being XMMful function (hashing 192 bytes per loop).

Hearing that 4th generation i7 i.e. HASWELL features much faster (than 3rd) L1 makes me to expect even more

favor for XMM, after all Core 2 is famous for its 32bit performance but i7 should be much more XMM friendly.

Who can send me 'RESULTS.TXT' obtained on HASWELL, gladly will post here the dump.

Monster face-offs are interesting and informative however after the havoc the question remains:

"What is the NIFTIEST multi-purpose hasher?"

My answer: 'FNV1A_YoshimitsuTRIADii', just deploy it and forget to worry about the outcome.



// FNV1A_YoshimitsuTRIADiiXMMx2 (revision 2 of FNV1A_YoshimitsuTRIADiiXMM, just unrolled once) aka FNV1A_penumbra, copyleft 2013-Jun-15 Kaze. // PENUMBRA: Any partial shade or shadow round a thing; a surrounding area of uncertain extent (lit. & fig.). [mod. Latin, from Latin paene almost + umbra shadow.] // // Hoy en mi ventana brilla el sol / The sun shines through my window today // Y el coraz�n se pone triste contemplando la ciudad / And my heart feels sad while contemplating the city // Porque te vas / Because you are leaving // Como cada noche despert� pensando en ti / Just like every night, I woke up thinking of you // Y en mi reloj todas las horas vi pasar / And I saw as all the hours passed by in my clock // Porque te vas / Because you are leaving // Todas las promesas de mi amor se ir�n contigo / All my love promises will be gone with you // Me olvidaras, me olvidaras / You will forget me, you will forget me // Junto a la estaci�n llorar� igual que un ni�o / Next to the station I will cry like a child // Porque te vas, porque te vas / Because you are leaving, because you are leaving // Bajo la penumbra de un farol se dormir�n / Under the shadow of a street lamp they will sleep // Todas las cosas que quedaron por decir se dormir�n / All the things left unsaid will sleep there // Junto a las manillas de un reloj esperar�n / They will wait next to a clock's hands // Todas las horas que quedaron por vivir esperar�n / They will wait for all those hours that we had yet to live // /J[e]anette - 'Porque te vas' lyrics/ // // Many dependencies, many mini-goals, many restrictions... Blah-blah-blah... // Yet in my amateurish view the NIFTIEST HT lookups function emerged, it is FNV1A_YoshimitsuTRIADii. // Main feature: general purpose HT lookups function targeted as 32bit code and 32bit stamp, superfast for 'any length' keys, escpecially useful for text messages. // //#include <emmintrin.h> //SSE2 //#include <smmintrin.h> //SSE4.1 //#include <immintrin.h> //AVX #define xmmload(p) _mm_load_si128((__m128i const*)(p)) #define xmmloadu(p) _mm_loadu_si128((__m128i const*)(p)) #define _rotl_KAZE128(x, n) _mm_or_si128(_mm_slli_si128(x, n) , _mm_srli_si128(x, 128-n)) #define _rotl_KAZE32(x, n) (((x) << (n)) | ((x) >> (32-(n)))) #define XMM_KAZE_SSE2 // For better mixing the above 'define' should be commented while the next one uncommented! //#define XMM_KAZE_SSE4 uint32_t FNV1A_penumbra(const char *str, uint32_t wrdlen) { const uint32_t PRIME = 709607; uint32_t hash32 = 2166136261; uint32_t hash32B = 2166136261; uint32_t hash32C = 2166136261; const char *p = str; uint32_t Loop_Counter; uint32_t Second_Line_Offset; #if defined(XMM_KAZE_SSE2) || defined(XMM_KAZE_SSE4) || defined(XMM_KAZE_AVX) __m128i xmm0; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm1; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm2; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm3; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm4; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm5; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm0nd; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm1nd; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm2nd; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm3nd; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm4nd; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i xmm5nd; // Defined for clarity: No need of defining it, the compiler sees well and uses no intermediate. __m128i hash32xmm = _mm_set1_epi32(2166136261); __m128i hash32Bxmm = _mm_set1_epi32(2166136261); __m128i hash32Cxmm = _mm_set1_epi32(2166136261); __m128i PRIMExmm = _mm_set1_epi32(709607); #endif #if defined(XMM_KAZE_SSE2) || defined(XMM_KAZE_SSE4) || defined(XMM_KAZE_AVX) if (wrdlen >= 2*4*24) { // Actually 2*4*24 is the minimum and not useful, 200++ makes more sense. Loop_Counter = (wrdlen/(2*4*24)); Loop_Counter++; Second_Line_Offset = wrdlen-(Loop_Counter)*(2*4*3*4); for(; Loop_Counter; Loop_Counter--, p += 2*4*3*sizeof(uint32_t)) { xmm0 = xmmloadu(p+0*16); xmm1 = xmmloadu(p+0*16+Second_Line_Offset); xmm2 = xmmloadu(p+1*16); xmm3 = xmmloadu(p+1*16+Second_Line_Offset); xmm4 = xmmloadu(p+2*16); xmm5 = xmmloadu(p+2*16+Second_Line_Offset); xmm0nd = xmmloadu(p+3*16); xmm1nd = xmmloadu(p+3*16+Second_Line_Offset); xmm2nd = xmmloadu(p+4*16); xmm3nd = xmmloadu(p+4*16+Second_Line_Offset); xmm4nd = xmmloadu(p+5*16); xmm5nd = xmmloadu(p+5*16+Second_Line_Offset); #if defined(XMM_KAZE_SSE2) hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0nd,5) , xmm1nd)) , PRIMExmm); hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3nd,5) , xmm2nd)) , PRIMExmm); hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4nd,5) , xmm5nd)) , PRIMExmm); #else hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0nd,5) , xmm1nd)) , PRIMExmm); hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3nd,5) , xmm2nd)) , PRIMExmm); hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4nd,5) , xmm5nd)) , PRIMExmm); #endif } #if defined(XMM_KAZE_SSE2) hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , hash32Bxmm) , PRIMExmm); hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , hash32Cxmm) , PRIMExmm); #else hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , hash32Bxmm) , PRIMExmm); hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , hash32Cxmm) , PRIMExmm); #endif hash32 = (hash32 ^ hash32xmm.m128i_u32[0]) * PRIME; hash32B = (hash32B ^ hash32xmm.m128i_u32[3]) * PRIME; hash32 = (hash32 ^ hash32xmm.m128i_u32[1]) * PRIME; hash32B = (hash32B ^ hash32xmm.m128i_u32[2]) * PRIME; } else if (wrdlen >= 24) #else if (wrdlen >= 24) #endif { Loop_Counter = (wrdlen/24); Loop_Counter++; Second_Line_Offset = wrdlen-(Loop_Counter)*(3*4); for(; Loop_Counter; Loop_Counter--, p += 3*sizeof(uint32_t)) { hash32 = (hash32 ^ (_rotl_KAZE32(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME; hash32B = (hash32B ^ (_rotl_KAZE32(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME; hash32C = (hash32C ^ (_rotl_KAZE32(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+8+Second_Line_Offset))) * PRIME; } hash32 = (hash32 ^ _rotl_KAZE32(hash32C,5) ) * PRIME; } else { // 1111=15; 10111=23 if (wrdlen & 4*sizeof(uint32_t)) { hash32 = (hash32 ^ (_rotl_KAZE32(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME; hash32B = (hash32B ^ (_rotl_KAZE32(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+12))) * PRIME; p += 8*sizeof(uint16_t); } // Cases: 0,1,2,3,4,5,6,7,...,15 if (wrdlen & 2*sizeof(uint32_t)) { hash32 = (hash32 ^ *(uint32_t*)(p+0)) * PRIME; hash32B = (hash32B ^ *(uint32_t*)(p+4)) * PRIME; p += 4*sizeof(uint16_t); } // Cases: 0,1,2,3,4,5,6,7 if (wrdlen & sizeof(uint32_t)) { hash32 = (hash32 ^ *(uint16_t*)(p+0)) * PRIME; hash32B = (hash32B ^ *(uint16_t*)(p+2)) * PRIME; p += 2*sizeof(uint16_t); } if (wrdlen & sizeof(uint16_t)) { hash32 = (hash32 ^ *(uint16_t*)p) * PRIME; p += sizeof(uint16_t); } if (wrdlen & 1) hash32 = (hash32 ^ *p) * PRIME; } hash32 = (hash32 ^ _rotl_KAZE32(hash32B,5) ) * PRIME; return hash32 ^ (hash32 >> 16); } /* Main loop, 345a-334e+1+7= 276 bytes, using SSE4: ; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011"; ; mark_description "-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FAcs"; ;;; for(; Loop_Counter; Loop_Counter--, p += 2*4*3*sizeof(uint32_t)) { 03346 85 d2 test edx, edx 03348 0f 84 14 01 00 00 je .B5.6 .B5.4: ;;; xmm0 = xmmloadu(p+0*16); 0334e f3 0f 6f 09 movdqu xmm1, XMMWORD PTR [ecx] ;;; xmm1 = xmmloadu(p+0*16+Second_Line_Offset); ;;; xmm2 = xmmloadu(p+1*16); ;;; xmm3 = xmmloadu(p+1*16+Second_Line_Offset); 03352 f3 0f 6f 7c 19 10 movdqu xmm7, XMMWORD PTR [16+ecx+ebx] ;;; xmm4 = xmmloadu(p+2*16); 03358 f3 0f 6f 69 20 movdqu xmm5, XMMWORD PTR [32+ecx] ;;; xmm5 = xmmloadu(p+2*16+Second_Line_Offset); ;;; xmm0nd = xmmloadu(p+3*16); 0335d f3 0f 6f 61 30 movdqu xmm4, XMMWORD PTR [48+ecx] ;;; xmm1nd = xmmloadu(p+3*16+Second_Line_Offset); ;;; xmm2nd = xmmloadu(p+4*16); ;;; xmm3nd = xmmloadu(p+4*16+Second_Line_Offset); ;;; xmm4nd = xmmloadu(p+5*16); ;;; xmm5nd = xmmloadu(p+5*16+Second_Line_Offset); ;;; #if defined(XMM_KAZE_SSE2) ;;; hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); ;;; hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); ;;; hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); ;;; hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0nd,5) , xmm1nd)) , PRIMExmm); ;;; hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3nd,5) , xmm2nd)) , PRIMExmm); ;;; hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4nd,5) , xmm5nd)) , PRIMExmm); ;;; #else ;;; hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); 03362 66 0f 6f d1 movdqa xmm2, xmm1 03366 66 0f 73 fa 05 pslldq xmm2, 5 0336b 66 0f 73 d9 7b psrldq xmm1, 123 03370 66 0f eb d1 por xmm2, xmm1 03374 f3 0f 6f 0c 19 movdqu xmm1, XMMWORD PTR [ecx+ebx] 03379 66 0f ef d1 pxor xmm2, xmm1 ;;; hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); 0337d 66 0f 6f cf movdqa xmm1, xmm7 03381 66 0f 73 f9 05 pslldq xmm1, 5 03386 66 0f ef f2 pxor xmm6, xmm2 0338a 66 0f 73 df 7b psrldq xmm7, 123 0338f 66 0f eb cf por xmm1, xmm7 03393 f3 0f 6f 79 10 movdqu xmm7, XMMWORD PTR [16+ecx] 03398 66 0f ef cf pxor xmm1, xmm7 ;;; hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); 0339c 66 0f 6f fd movdqa xmm7, xmm5 033a0 66 0f 73 ff 05 pslldq xmm7, 5 033a5 66 0f ef d9 pxor xmm3, xmm1 033a9 66 0f 73 dd 7b psrldq xmm5, 123 033ae 66 0f eb fd por xmm7, xmm5 033b2 f3 0f 6f 6c 19 20 movdqu xmm5, XMMWORD PTR [32+ecx+ebx] ;;; hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0nd,5) , xmm1nd)) , PRIMExmm); ;;; hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3nd,5) , xmm2nd)) , PRIMExmm); 033b8 f3 0f 6f 4c 19 40 movdqu xmm1, XMMWORD PTR [64+ecx+ebx] 033be 66 0f ef fd pxor xmm7, xmm5 033c2 66 0f 6f ec movdqa xmm5, xmm4 033c6 66 0f 73 fd 05 pslldq xmm5, 5 033cb 66 0f ef c7 pxor xmm0, xmm7 033cf 66 0f 73 dc 7b psrldq xmm4, 123 033d4 66 0f eb ec por xmm5, xmm4 033d8 f3 0f 6f 64 19 30 movdqu xmm4, XMMWORD PTR [48+ecx+ebx] ;;; hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4nd,5) , xmm5nd)) , PRIMExmm); 033de f3 0f 6f 79 50 movdqu xmm7, XMMWORD PTR [80+ecx] 033e3 66 0f 6f 15 00 00 00 00 movdqa xmm2, XMMWORD PTR [_2il0floatpacket.65] 033eb 66 0f ef ec pxor xmm5, xmm4 033ef 66 0f 38 40 f2 pmulld xmm6, xmm2 033f4 66 0f ef f5 pxor xmm6, xmm5 033f8 66 0f 6f e9 movdqa xmm5, xmm1 033fc 66 0f 73 fd 05 pslldq xmm5, 5 03401 66 0f 73 d9 7b psrldq xmm1, 123 03406 66 0f eb e9 por xmm5, xmm1 0340a 66 0f 6f cf movdqa xmm1, xmm7 0340e 66 0f 73 f9 05 pslldq xmm1, 5 03413 66 0f 73 df 7b psrldq xmm7, 123 03418 f3 0f 6f 61 40 movdqu xmm4, XMMWORD PTR [64+ecx] 0341d 66 0f eb cf por xmm1, xmm7 03421 66 0f ef ec pxor xmm5, xmm4 03425 f3 0f 6f 7c 19 50 movdqu xmm7, XMMWORD PTR [80+ecx+ebx] 0342b 66 0f 38 40 da pmulld xmm3, xmm2 03430 66 0f ef cf pxor xmm1, xmm7 03434 66 0f 38 40 c2 pmulld xmm0, xmm2 03439 66 0f ef dd pxor xmm3, xmm5 0343d 66 0f ef c1 pxor xmm0, xmm1 03441 83 c1 60 add ecx, 96 03444 66 0f 38 40 f2 pmulld xmm6, xmm2 03449 4a dec edx 0344a 66 0f 38 40 da pmulld xmm3, xmm2 0344f 66 0f 38 40 c2 pmulld xmm0, xmm2 03454 0f 85 f4 fe ff ff jne .B5.4 .B5.5: 0345a 66 0f 6f 0d 00 00 00 00 movdqa xmm1, XMMWORD PTR [_2il0floatpacket.65] .B5.6: */

Press on legendary Shintaro Katsu to download 'YoYo_r2.zip' which measures bandwidth speed and some collisions, the dump below comes from this test package.



Blah-blah... the latest 'put-and-forget' hasher is 'FNV1A_YoshimitsuTRIADii'... yet for the sake of fun, speed training and superfast file hashing XMM/YMM variants should be implemented. Let's see what boost can deliver XMM, following dump was obtained using my laptop with Intel T7500: Info1: One second seems to have 998 clocks. Info2: This CPU seems to be working at 2,191 MHz. Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15132 clocks or 4.331MB per clock BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 13946 clocks or 4.699MB per clock FNV1A_YoshimitsuTRIADiiXMM: (64MB block); 65536MB hashed in 13572 clocks or 4.829MB per clock !!! FLASHY-SLASHY: OUTSPEEDS THE INTERLEAVED 8x4 READ !!! FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14399 clocks or 4.551MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 15912 clocks or 4.119MB per clock FNV1A_Yorikke: (64MB block); 65536MB hashed in 16427 clocks or 3.990MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14555 clocks or 4.503MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71588 clocks or 0.915MB per clock Fetching/Hashing a 2MB block 32*1024 times ... BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9532 clocks or 6.875MB per clock BURST_Read_8DWORDSi: (2MB block); 65536MB fetched in 9844 clocks or 6.657MB per clock FNV1A_YoshimitsuTRIADiiXMM: (2MB block); 65536MB hashed in 7332 clocks or 8.938MB per clock !!! COMMENTLESS !!! FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 10155 clocks or 6.454MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9766 clocks or 6.711MB per clock FNV1A_Yorikke: (2MB block); 65536MB hashed in 10171 clocks or 6.443MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 10717 clocks or 6.115MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69764 clocks or 0.939MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7863 clocks or 8.335MB per clock BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 7894 clocks or 8.302MB per clock FNV1A_YoshimitsuTRIADiiXMM: (16KB block); 65536MB hashed in 6973 clocks or 9.399MB per clock !!! WIGGING-OUT: 894% faster than CRC32_SlicingBy8 !!! FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8892 clocks or 7.370MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 9110 clocks or 7.194MB per clock FNV1A_Yorikke: (16KB block); 65536MB hashed in 9657 clocks or 6.786MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9734 clocks or 6.733MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69342 clocks or 0.945MB per clock Delerium - Hammer (Feat. Leona Naess) ... Sad eyes you got songs And they hit me like a hammer hammer The whole time I thought you Were just a wilted little flower flower ... Sad eyes you got soul And it hits me like a stammer stammer The whole time I thought I Was the one with the power power // Notes, 2013-Apr-26: // Wanted to see how SIMDed main loop would look like: // One of the main goals: to stress 128bit registers only and nothing else, for now 6 in total, in fact Intel uses the all 8. // Current approach: instead of rotating the 5 bits within the DWORD quadruplets I chose to do it within the entire DQWORD i.e. XMMWORD. // Length of the main loop: 02795 - 0270d + 6 = 142 bytes // CRASH CARAMBA: My CPU T7500 supports up to SSSE3 but not SSE4.1 and AVX, I need YMM (it reads 'yummy') machine.

// FNV1A_YoshimitsuTRIADiiXMM revision 1+ aka FNV1A_SaberFatigue, copyleft 2013-Apr-26 Kaze. // Targeted purpose: x-gram table lookups for Leprechaun r17. // Targeted machine: assuming SSE2 is present always - no non-SSE2 counterpart. //#include <emmintrin.h> //SSE2 //#include <smmintrin.h> //SSE4.1 //#include <immintrin.h> //AVX #define xmmload(p) _mm_load_si128((__m128i const*)(p)) #define xmmloadu(p) _mm_loadu_si128((__m128i const*)(p)) #define _rotl_KAZE128(x, n) _mm_or_si128(_mm_slli_si128(x, n) , _mm_srli_si128(x, 128-n)) #define _rotl_KAZE32(x, n) (((x) << (n)) | ((x) >> (32-(n)))) #define XMM_KAZE_SSE2 uint32_t FNV1A_Hash_YoshimitsuTRIADiiXMM(const char *str, uint32_t wrdlen) { const uint32_t PRIME = 709607; uint32_t hash32 = 2166136261; uint32_t hash32B = 2166136261; uint32_t hash32C = 2166136261; const char *p = str; uint32_t Loop_Counter; uint32_t Second_Line_Offset; #if defined(XMM_KAZE_SSE2) || defined(XMM_KAZE_SSE4) || defined(XMM_KAZE_AVX) __m128i xmm0; __m128i xmm1; __m128i xmm2; __m128i xmm3; __m128i xmm4; __m128i xmm5; __m128i hash32xmm = _mm_set1_epi32(2166136261); __m128i hash32Bxmm = _mm_set1_epi32(2166136261); __m128i hash32Cxmm = _mm_set1_epi32(2166136261); __m128i PRIMExmm = _mm_set1_epi32(709607); #endif #if defined(XMM_KAZE_SSE2) || defined(XMM_KAZE_SSE4) || defined(XMM_KAZE_AVX) if (wrdlen >= 4*24) { // Actually 4*24 is the minimum and not useful, 200++ makes more sense. Loop_Counter = (wrdlen/(4*24)); Loop_Counter++; Second_Line_Offset = wrdlen-(Loop_Counter)*(4*3*4); for(; Loop_Counter; Loop_Counter--, p += 4*3*sizeof(uint32_t)) { xmm0 = xmmloadu(p+0*16); xmm1 = xmmloadu(p+0*16+Second_Line_Offset); xmm2 = xmmloadu(p+1*16); xmm3 = xmmloadu(p+1*16+Second_Line_Offset); xmm4 = xmmloadu(p+2*16); xmm5 = xmmloadu(p+2*16+Second_Line_Offset); #if defined(XMM_KAZE_SSE2) hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); #else hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); #endif } #if defined(XMM_KAZE_SSE2) hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , hash32Bxmm) , PRIMExmm); hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , hash32Cxmm) , PRIMExmm); #else hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , hash32Bxmm) , PRIMExmm); hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , hash32Cxmm) , PRIMExmm); #endif hash32 = (hash32 ^ hash32xmm.m128i_u32[0]) * PRIME; hash32B = (hash32B ^ hash32xmm.m128i_u32[3]) * PRIME; hash32 = (hash32 ^ hash32xmm.m128i_u32[1]) * PRIME; hash32B = (hash32B ^ hash32xmm.m128i_u32[2]) * PRIME; } else if (wrdlen >= 24) #else if (wrdlen >= 24) #endif { Loop_Counter = (wrdlen/24); Loop_Counter++; Second_Line_Offset = wrdlen-(Loop_Counter)*(3*4); for(; Loop_Counter; Loop_Counter--, p += 3*sizeof(uint32_t)) { hash32 = (hash32 ^ (_rotl_KAZE32(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME; hash32B = (hash32B ^ (_rotl_KAZE32(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME; hash32C = (hash32C ^ (_rotl_KAZE32(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+8+Second_Line_Offset))) * PRIME; } hash32 = (hash32 ^ _rotl_KAZE32(hash32C,5) ) * PRIME; } else { // 1111=15; 10111=23 if (wrdlen & 4*sizeof(uint32_t)) { hash32 = (hash32 ^ (_rotl_KAZE32(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME; hash32B = (hash32B ^ (_rotl_KAZE32(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+12))) * PRIME; p += 8*sizeof(uint16_t); } // Cases: 0,1,2,3,4,5,6,7,...,15 if (wrdlen & 2*sizeof(uint32_t)) { hash32 = (hash32 ^ *(uint32_t*)(p+0)) * PRIME; hash32B = (hash32B ^ *(uint32_t*)(p+4)) * PRIME; p += 4*sizeof(uint16_t); } // Cases: 0,1,2,3,4,5,6,7 if (wrdlen & sizeof(uint32_t)) { hash32 = (hash32 ^ *(uint16_t*)(p+0)) * PRIME; hash32B = (hash32B ^ *(uint16_t*)(p+2)) * PRIME; p += 2*sizeof(uint16_t); } if (wrdlen & sizeof(uint16_t)) { hash32 = (hash32 ^ *(uint16_t*)p) * PRIME; p += sizeof(uint16_t); } if (wrdlen & 1) hash32 = (hash32 ^ *p) * PRIME; } hash32 = (hash32 ^ _rotl_KAZE32(hash32B,5) ) * PRIME; return hash32 ^ (hash32 >> 16); }

/* !!! The main (SSE2 i.e. 'pmullw' used) loop: !!! .B4.4: 0270d 8d 3c 76 lea edi, DWORD PTR [esi+esi*2] 02710 46 inc esi 02711 c1 e7 04 shl edi, 4 02714 3b f2 cmp esi, edx 02716 f3 0f 6f 3c 39 movdqu xmm7, XMMWORD PTR [ecx+edi] 0271b f3 0f 6f 74 3b 10 movdqu xmm6, XMMWORD PTR [16+ebx+edi] ;;; xmm4 = xmmloadu(p+2*16); 02721 f3 0f 6f 6c 39 20 movdqu xmm5, XMMWORD PTR [32+ecx+edi] ;;; xmm5 = xmmloadu(p+2*16+Second_Line_Offset); ;;; #if defined(XMM_KAZE_SSE2) ;;; hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); 02727 66 0f 6f cf movdqa xmm1, xmm7 0272b 66 0f 73 f9 05 pslldq xmm1, 5 02730 66 0f 73 df 7b psrldq xmm7, 123 02735 66 0f eb cf por xmm1, xmm7 02739 f3 0f 6f 3c 3b movdqu xmm7, XMMWORD PTR [ebx+edi] 0273e 66 0f ef cf pxor xmm1, xmm7 02742 66 0f ef d1 pxor xmm2, xmm1 ;;; hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); 02746 66 0f 6f ce movdqa xmm1, xmm6 0274a 66 0f 73 f9 05 pslldq xmm1, 5 0274f 66 0f 73 de 7b psrldq xmm6, 123 02754 66 0f eb ce por xmm1, xmm6 02758 f3 0f 6f 74 39 10 movdqu xmm6, XMMWORD PTR [16+ecx+edi] 0275e 66 0f d5 d0 pmullw xmm2, xmm0 02762 66 0f ef ce pxor xmm1, xmm6 ;;; hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); 02766 66 0f 6f f5 movdqa xmm6, xmm5 0276a 66 0f 73 fe 05 pslldq xmm6, 5 0276f 66 0f ef d9 pxor xmm3, xmm1 02773 66 0f 73 dd 7b psrldq xmm5, 123 02778 66 0f eb f5 por xmm6, xmm5 0277c f3 0f 6f 6c 3b 20 movdqu xmm5, XMMWORD PTR [32+ebx+edi] 02782 66 0f d5 d8 pmullw xmm3, xmm0 02786 66 0f ef f5 pxor xmm6, xmm5 0278a 66 0f ef e6 pxor xmm4, xmm6 0278e 66 0f d5 e0 pmullw xmm4, xmm0 02792 0f 82 75 ff ff ff jb .B4.4 !!! The main (SSE4.1 i.e. 'pmulld' used) loop: !!! .B4.4: 0270d 8d 3c 76 lea edi, DWORD PTR [esi+esi*2] 02710 46 inc esi 02711 c1 e7 04 shl edi, 4 02714 3b f2 cmp esi, edx 02716 f3 0f 6f 3c 39 movdqu xmm7, XMMWORD PTR [ecx+edi] 0271b f3 0f 6f 74 3b 10 movdqu xmm6, XMMWORD PTR [16+ebx+edi] ;;; xmm4 = xmmloadu(p+2*16); 02721 f3 0f 6f 6c 39 20 movdqu xmm5, XMMWORD PTR [32+ecx+edi] ;;; xmm5 = xmmloadu(p+2*16+Second_Line_Offset); ;;; #if defined(XMM_KAZE_SSE2) ;;; hash32xmm = _mm_mullo_epi16(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); ;;; hash32Bxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); ;;; hash32Cxmm = _mm_mullo_epi16(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); ;;; #else ;;; hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm); 02727 66 0f 6f cf movdqa xmm1, xmm7 0272b 66 0f 73 f9 05 pslldq xmm1, 5 02730 66 0f 73 df 7b psrldq xmm7, 123 02735 66 0f eb cf por xmm1, xmm7 02739 f3 0f 6f 3c 3b movdqu xmm7, XMMWORD PTR [ebx+edi] 0273e 66 0f ef cf pxor xmm1, xmm7 02742 66 0f ef d1 pxor xmm2, xmm1 ;;; hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm); 02746 66 0f 6f ce movdqa xmm1, xmm6 0274a 66 0f 73 f9 05 pslldq xmm1, 5 0274f 66 0f 73 de 7b psrldq xmm6, 123 02754 66 0f eb ce por xmm1, xmm6 02758 f3 0f 6f 74 39 10 movdqu xmm6, XMMWORD PTR [16+ecx+edi] 0275e 66 0f ef ce pxor xmm1, xmm6 ;;; hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm); 02762 66 0f 6f f5 movdqa xmm6, xmm5 02766 66 0f 73 fe 05 pslldq xmm6, 5 0276b 66 0f ef d9 pxor xmm3, xmm1 0276f 66 0f 73 dd 7b psrldq xmm5, 123 02774 66 0f eb f5 por xmm6, xmm5 02778 f3 0f 6f 6c 3b 20 movdqu xmm5, XMMWORD PTR [32+ebx+edi] 0277e 66 0f ef f5 pxor xmm6, xmm5 02782 66 0f ef e6 pxor xmm4, xmm6 02786 66 0f 38 40 d0 pmulld xmm2, xmm0 0278b 66 0f 38 40 d8 pmulld xmm3, xmm0 02790 66 0f 38 40 e0 pmulld xmm4, xmm0 02795 0f 82 72 ff ff ff jb .B4.4 */ // emmintrin.h // Principal header file for Intel(R) Pentium(R) 4 processor SSE2 intrinsics //extern __m128i __ICL_INTRINCC _mm_mulhi_epi16(__m128i, __m128i); //extern __m128i __ICL_INTRINCC _mm_mulhi_epu16(__m128i, __m128i); //extern __m128i __ICL_INTRINCC _mm_mullo_epi16(__m128i, __m128i); //extern __m128i __ICL_INTRINCC _mm_mul_epu32(__m128i, __m128i); // // smmintrin.h // SSE4.1 intrinsics // Packed integer 32-bit multiplication with truncation of upper halves of results //extern __m128i __ICL_INTRINCC _mm_mullo_epi32(__m128i, __m128i); // Packed integer 32-bit multiplication of 2 pairs of operands producing two 64-bit results //extern __m128i __ICL_INTRINCC _mm_mul_epi32(__m128i, __m128i); // // immintrin.h // Intel(R) AVX compiler intrinsics. //extern __m256i __cdecl _mm256_mullo_epi32(__m256i, __m256i); //extern __m256i __cdecl _mm256_mul_epu32(__m256i, __m256i); //extern __m256i __cdecl _mm256_mul_epi32(__m256i, __m256i); //... // __int8 m256i_i8[32]; // __int16 m256i_i16[16]; // __int32 m256i_i32[8]; // __int64 m256i_i64[4]; // unsigned __int8 m256i_u8[32]; // unsigned __int16 m256i_u16[16]; // unsigned __int32 m256i_u32[8]; // unsigned __int64 m256i_u64[4]; //} __m256i; // // Move Aligned Packed Integer Values // **** VMOVDQA ymm1, m256 // **** VMOVDQA m256, ymm1 // Moves 256 bits of packed integer values from the source operand to the destination //extern __m256i __ICL_INTRINCC _mm256_load_si256(__m256i const *); //extern void __ICL_INTRINCC _mm256_store_si256(__m256i *, __m256i); // // Move Unaligned Packed Integer Values // **** VMOVDQU ymm1, m256 // **** VMOVDQU m256, ymm1 // Moves 256 bits of packed integer values from the source operand to the destination //extern __m256i __ICL_INTRINCC _mm256_loadu_si256(__m256i const *); //extern void __ICL_INTRINCC _mm256_storeu_si256(__m256i *, __m256i); // // VMOVDQA sounds to me almost as JAMIROQUAI - the weirdest hat artist with fast lively movements. // When YMM machine comes to me FNV1A_YUMMY and FNV1A_JAMIROQUAI are gonna walk on the walls and hit the ceiling. // /* hash32xmm = _mm_set1_epi32(1); printf("%lu,%lu,%lu,%lu

",PRIMExmm.m128i_u32[0],PRIMExmm.m128i_u32[1],PRIMExmm.m128i_u32[2],PRIMExmm.m128i_u32[3]); printf("%lu,%lu,%lu,%lu

",hash32xmm.m128i_u32[0],hash32xmm.m128i_u32[1],hash32xmm.m128i_u32[2],hash32xmm.m128i_u32[3]); hash32Bxmm = _mm_mul_epu32( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); hash32Bxmm = _mm_mullo_epi16( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); // 709607,709607,709607,709607 // 1,1,1,1 // 709607,0,709607,0 // 54247,54247,54247,54247 hash32xmm = _mm_set1_epi32(3); printf("%lu,%lu,%lu,%lu

",PRIMExmm.m128i_u32[0],PRIMExmm.m128i_u32[1],PRIMExmm.m128i_u32[2],PRIMExmm.m128i_u32[3]); printf("%lu,%lu,%lu,%lu

",hash32xmm.m128i_u32[0],hash32xmm.m128i_u32[1],hash32xmm.m128i_u32[2],hash32xmm.m128i_u32[3]); hash32Bxmm = _mm_mul_epu32( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); hash32Bxmm = _mm_mullo_epi16( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); // 709607,709607,709607,709607 // 3,3,3,3 // 2128821,0,2128821,0 // 31669,31669,31669,31669 hash32xmm = _mm_set1_epi32(3000000); printf("%lu,%lu,%lu,%lu

",PRIMExmm.m128i_u32[0],PRIMExmm.m128i_u32[1],PRIMExmm.m128i_u32[2],PRIMExmm.m128i_u32[3]); printf("%lu,%lu,%lu,%lu

",hash32xmm.m128i_u32[0],hash32xmm.m128i_u32[1],hash32xmm.m128i_u32[2],hash32xmm.m128i_u32[3]); hash32Bxmm = _mm_mul_epu32( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); hash32Bxmm = _mm_mullo_epi16( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); // 709607,709607,709607,709607 // 3000000,3000000,3000000,3000000 // 2812188480,495,2812188480,495 !!! 495 *(1<<32)+2812188480=2128821000000 = 709607*3000000 !!! // 29529920,29529920,29529920,29529920 hash32xmm = _mm_set1_epi32(65536); printf("%lu,%lu,%lu,%lu

",PRIMExmm.m128i_u32[0],PRIMExmm.m128i_u32[1],PRIMExmm.m128i_u32[2],PRIMExmm.m128i_u32[3]); printf("%lu,%lu,%lu,%lu

",hash32xmm.m128i_u32[0],hash32xmm.m128i_u32[1],hash32xmm.m128i_u32[2],hash32xmm.m128i_u32[3]); hash32Bxmm = _mm_mul_epu32( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); hash32Bxmm = _mm_mullo_epi16( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); // 709607,709607,709607,709607 // 65536,65536,65536,65536 // 3555131392,10,3555131392,10 // 655360,655360,655360,655360 hash32xmm = _mm_set1_epi32(65535); printf("%lu,%lu,%lu,%lu

",PRIMExmm.m128i_u32[0],PRIMExmm.m128i_u32[1],PRIMExmm.m128i_u32[2],PRIMExmm.m128i_u32[3]); printf("%lu,%lu,%lu,%lu

",hash32xmm.m128i_u32[0],hash32xmm.m128i_u32[1],hash32xmm.m128i_u32[2],hash32xmm.m128i_u32[3]); hash32Bxmm = _mm_mul_epu32( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); hash32Bxmm = _mm_mullo_epi16( hash32xmm, PRIMExmm); printf("%lu,%lu,%lu,%lu

",hash32Bxmm.m128i_u32[0],hash32Bxmm.m128i_u32[1],hash32Bxmm.m128i_u32[2],hash32Bxmm.m128i_u32[3]); // 709607,709607,709607,709607 // 65535,65535,65535,65535 // 3554421785,10,3554421785,10 // 11289,11289,11289,11289 */ /* !!! The results *unaligned* on my T7500 2200MHz as 32bit Intel 12.1 /Ox compile: !!! Memory pool starting address: 006E0041 ... NOT 64 byte aligned, FAILURE Info1: One second seems to have 998 clocks. Info2: This CPU seems to be working at 2,191 MHz. Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 16521 clocks or 3.967MB per clock BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 15787 clocks or 4.151MB per clock FNV1A_YoshimitsuTRIADiiXMM: (64MB block); 65536MB hashed in 15958 clocks or 4.107MB per clock !!! BRUTALICIOUS: 4GB/s as 32bit XMM code and *unaligned data* !!! FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 17738 clocks or 3.695MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 18003 clocks or 3.640MB per clock FNV1A_Yorikke: (64MB block); 65536MB hashed in 19593 clocks or 3.345MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 19235 clocks or 3.407MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71745 clocks or 0.913MB per clock Fetching/Hashing a 2MB block 32*1024 times ... BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 12277 clocks or 5.338MB per clock BURST_Read_8DWORDSi: (2MB block); 65536MB fetched in 12839 clocks or 5.104MB per clock FNV1A_YoshimitsuTRIADiiXMM: (2MB block); 65536MB hashed in 13370 clocks or 4.902MB per clock FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 15741 clocks or 4.163MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 15803 clocks or 4.147MB per clock FNV1A_Yorikke: (2MB block); 65536MB hashed in 17581 clocks or 3.728MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 17644 clocks or 3.714MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69888 clocks or 0.938MB per clock Fetching/Hashing a 128KB block 512*1024 times ... BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 12183 clocks or 5.379MB per clock BURST_Read_8DWORDSi: (128KB block); 65536MB fetched in 13479 clocks or 4.862MB per clock FNV1A_YoshimitsuTRIADiiXMM: (128KB block); 65536MB hashed in 13291 clocks or 4.931MB per clock FNV1A_YoshimitsuTRIADii: (128KB block); 65536MB hashed in 15959 clocks or 4.107MB per clock FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 15803 clocks or 4.147MB per clock FNV1A_Yorikke: (128KB block); 65536MB hashed in 17597 clocks or 3.724MB per clock FNV1A_Yoshimura: (128KB block); 65536MB hashed in 17675 clocks or 3.708MB per clock CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 69841 clocks or 0.938MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 11606 clocks or 5.647MB per clock BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 12106 clocks or 5.414MB per clock FNV1A_YoshimitsuTRIADiiXMM: (16KB block); 65536MB hashed in 13135 clocks or 4.989MB per clock FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 15522 clocks or 4.222MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 15803 clocks or 4.147MB per clock FNV1A_Yorikke: (16KB block); 65536MB hashed in 17643 clocks or 3.715MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 17738 clocks or 3.695MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69841 clocks or 0.938MB per clock !!! The results *aligned* on my T7500 2200MHz as 32bit Intel 12.1 /Ox compile: !!! Memory pool starting address: 00DF0040 ... 64 byte aligned, OK Info1: One second seems to have 998 clocks. Info2: This CPU seems to be working at 2,191 MHz. Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15132 clocks or 4.331MB per clock BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 13946 clocks or 4.699MB per clock FNV1A_YoshimitsuTRIADiiXMM: (64MB block); 65536MB hashed in 13572 clocks or 4.829MB per clock !!! FLASHY-SLASHY: OUTSPEEDS THE INTERLEAVED 8x4 READ !!! FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14399 clocks or 4.551MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 15912 clocks or 4.119MB per clock FNV1A_Yorikke: (64MB block); 65536MB hashed in 16427 clocks or 3.990MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14555 clocks or 4.503MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71588 clocks or 0.915MB per clock Fetching/Hashing a 2MB block 32*1024 times ... BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9532 clocks or 6.875MB per clock BURST_Read_8DWORDSi: (2MB block); 65536MB fetched in 9844 clocks or 6.657MB per clock FNV1A_YoshimitsuTRIADiiXMM: (2MB block); 65536MB hashed in 7332 clocks or 8.938MB per clock !!! COMMENTLESS !!! FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 10155 clocks or 6.454MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9766 clocks or 6.711MB per clock FNV1A_Yorikke: (2MB block); 65536MB hashed in 10171 clocks or 6.443MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 10717 clocks or 6.115MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69764 clocks or 0.939MB per clock Fetching/Hashing a 128KB block 512*1024 times ... BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9235 clocks or 7.096MB per clock BURST_Read_8DWORDSi: (128KB block); 65536MB fetched in 8876 clocks or 7.384MB per clock FNV1A_YoshimitsuTRIADiiXMM: (128KB block); 65536MB hashed in 7254 clocks or 9.034MB per clock FNV1A_YoshimitsuTRIADii: (128KB block); 65536MB hashed in 9360 clocks or 7.002MB per clock FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 9672 clocks or 6.776MB per clock FNV1A_Yorikke: (128KB block); 65536MB hashed in 10109 clocks or 6.483MB per clock FNV1A_Yoshimura: (128KB block); 65536MB hashed in 9937 clocks or 6.595MB per clock CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 69888 clocks or 0.938MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7863 clocks or 8.335MB per clock BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 7894 clocks or 8.302MB per clock FNV1A_YoshimitsuTRIADiiXMM: (16KB block); 65536MB hashed in 6973 clocks or 9.399MB per clock !!! WIGGING-OUT: 894% faster than CRC32_SlicingBy8 !!! FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8892 clocks or 7.370MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 9110 clocks or 7.194MB per clock FNV1A_Yorikke: (16KB block); 65536MB hashed in 9657 clocks or 6.786MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9734 clocks or 6.733MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69342 clocks or 0.945MB per clock */ /* !!! The Knight-Tours results: !!! YoYo - [CR]LF lines hasher, r.2 copyleft Kaze. Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 20 Allocating KEY memory 1024KB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Hashing all the LF ending lines encountered in 136,314,880 bytes long file ... Keys vs Slots ratio: 1:1 or 1,048,576:1,048,576 FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,001,048,576; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,386,039; HashUtilization = 063%; Collisions = 0,000,386,039 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,001,048,576; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,367; HashUtilization = 063%; Collisions = 0,000,385,367 CRC32C2_8slice : Keys = 00,000,000,000,001,048,576; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,451; HashUtilization = 063%; Collisions = 0,000,385,451 Physical Lines: 1,048,576 Shortest Line : 128 Longest Line : 128 YoYo - [CR]LF lines hasher, r.2 copyleft Kaze. Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 20 Allocating KEY memory 1024KB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Hashing all the LF ending lines encountered in 272,629,760 bytes long file ... Keys vs Slots ratio: 2:1 or 2,097,152:1,048,576 FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,002,097,152; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,013; HASHfreeSLOTS = 0,000,142,022; HashUtilization = 086%; Collisions = 0,001,190,598 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,002,097,152; 000,000,007 x MAXcollisionsAtSomeSlots = 0,000,000,011; HASHfreeSLOTS = 0,000,141,227; HashUtilization = 086%; Collisions = 0,001,189,803 CRC32C2_8slice : Keys = 00,000,000,000,002,097,152; 000,000,008 x MAXcollisionsAtSomeSlots = 0,000,000,011; HASHfreeSLOTS = 0,000,141,267; HashUtilization = 086%; Collisions = 0,001,189,843 Physical Lines: 2,097,152 Shortest Line : 128 Longest Line : 128 YoYo - [CR]LF lines hasher, r.2 copyleft Kaze. Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 20 Allocating KEY memory 1024KB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Hashing all the LF ending lines encountered in 408,944,640 bytes long file ... Keys vs Slots ratio: 3:1 or 3,145,728:1,048,576 FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,003,145,728; 000,000,005 x MAXcollisionsAtSomeSlots = 0,000,000,014; HASHfreeSLOTS = 0,000,051,972; HashUtilization = 095%; Collisions = 0,002,149,124 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,003,145,728; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,015; HASHfreeSLOTS = 0,000,051,812; HashUtilization = 095%; Collisions = 0,002,148,964 CRC32C2_8slice : Keys = 00,000,000,000,003,145,728; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,014; HASHfreeSLOTS = 0,000,051,926; HashUtilization = 095%; Collisions = 0,002,149,078 Physical Lines: 3,145,728 Shortest Line : 128 Longest Line : 128 YoYo - [CR]LF lines hasher, r.2 copyleft Kaze. Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 20 Allocating KEY memory 1024KB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Hashing all the LF ending lines encountered in 4,362,076,160 bytes long file ... Keys vs Slots ratio: 32:1 or 33,554,432:1,048,576 FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,033,554,432; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,061; HASHfreeSLOTS = 0,000,000,000; HashUtilization = 100%; Collisions = 0,032,505,856 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,033,554,432; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,064; HASHfreeSLOTS = 0,000,000,000; HashUtilization = 100%; Collisions = 0,032,505,856 CRC32C2_8slice : Keys = 00,000,000,000,033,554,432; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,062; HASHfreeSLOTS = 0,000,000,000; HashUtilization = 100%; Collisions = 0,032,505,856 Physical Lines: 33,554,432 Shortest Line : 128 Longest Line : 128 */ /* E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>dir "Thus Spake Zarathustra by Friedrich Nietzsche, revision 4.txt"/b>z.lst E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>Leprechaun_BB048hex_32p_32bit_Intel z.lst z.txt 1222333 y Leprechaun_BBhex (Fast-In-Future Greedy Building-Block-Ripper), subrev. B, BB = 48. ... E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>dir "MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd"/b>m.lst E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>Leprechaun_BB048hex_32p_32bit_Intel m.lst m.txt 1222333 y Leprechaun_BBhex (Fast-In-Future Greedy Building-Block-Ripper), subrev. B, BB = 48. ... E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>dir ?.txt Volume in drive E is SSD_Sanmayce Volume Serial Number is 9CF6-FEA3 04/26/2013 06:30 AM 379,209,236 m.txt 04/26/2013 06:30 AM 50,897,084 z.txt E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>type m.txt 7465726E6572730D0A6D696477657374776172640D0A6D69647769636B65740D0A6D6964776966650D0A6D6964776966 6B0D0A636861636B6C650D0A636861636B6C696E670D0A636861636D610D0A636861636F6E6E650D0A636861636F6E6E 0D0A616E7461676F6E697A61626C650D0A616E7461676F6E697A6174696F6E0D0A616E7461676F6E697A650D0A616E74 706F736572730D0A7472616E73706F7365730D0A7472616E73706F73696E670D0A7472616E73706F736974696F6E0D0A 62696C6974790D0A646973726570757461626C650D0A646973726570757461626C656E6573730D0A6469737265707574 ... E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>type z.txt 6F75722074686F7567687473210D0A416E6420696620796F75722074686F75676874732073756363756D622C20796F75 2C206E65772062656C696576657221940D0A9349742069732073616420656E6F7567682C9420616E7377657265642074 6D666F727461626C65206F6E65732C20746861742049542054414B45544820544F20495453454C462C20616E64207769 656C6C206D652C20796520627265746872656E2C206973206E6F742074686520737472616E67657374206F6620616C6C 72656420616E64207661696E20616E6420657374696D61626C652C206173209374686520676F6F6420616E64206A7573 ... E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>YoYo_r2.exe z.txt 20 YoYo - [CR]LF lines hasher, r.2 copyleft Kaze. Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 20 Allocating KEY memory 1024KB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Hashing all the LF ending lines encountered in 50,897,084 bytes long file ... Keys vs Slots ratio: 0:1 or 519,358:1,048,576 FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,000,519,358; 000,000,004 x MAXcollisionsAtSomeSlots = 0,000,000,007; HASHfreeSLOTS = 0,000,639,021; HashUtilization = 039%; Collisions = 0,000,109,803 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,000,519,358; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,007; HASHfreeSLOTS = 0,000,639,060; HashUtilization = 039%; Collisions = 0,000,109,842 CRC32C2_8slice : Keys = 00,000,000,000,000,519,358; 000,000,003 x MAXcollisionsAtSomeSlots = 0,000,000,007; HASHfreeSLOTS = 0,000,639,002; HashUtilization = 039%; Collisions = 0,000,109,784 Physical Lines: 519,358 Shortest Line : 96 Longest Line : 96 E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2>YoYo_r2.exe m.txt 20 YoYo - [CR]LF lines hasher, r.2 copyleft Kaze. Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 20 Allocating KEY memory 1024KB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Allocating HASH memory 4MB ... OK Hashing all the LF ending lines encountered in 379,209,236 bytes long file ... Keys vs Slots ratio: 3:1 or 3,869,482:1,048,576 FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,003,869,482; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,015; HASHfreeSLOTS = 0,000,026,167; HashUtilization = 097%; Collisions = 0,002,847,073 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,003,869,482; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,017; HASHfreeSLOTS = 0,000,026,207; HashUtilization = 097%; Collisions = 0,002,847,113 CRC32C2_8slice : Keys = 00,000,000,000,003,869,482; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,016; HASHfreeSLOTS = 0,000,026,305; HashUtilization = 097%; Collisions = 0,002,847,211 Physical Lines: 3,869,482 Shortest Line : 96 Longest Line : 96 E:\Benchmark_LuckyLight_r1\YoshimitsuTRIADiiXMM_r1\YoYo_r2> */



For reference:



Update, 2013-Mar-31: Finally I wrote YoYo - a collison/fattest-slot reporter for big files. From the 3-gram 'REAL WORLD' tortures, given below, the picture is quite clear: FNV1A_YoshimitsuTRIADii and FNV1A_Yoshimura are excellent for x-gram hashing.

The dump below was made with

For reference: www.sanmayce.com/Fastest_Hash/index.html#YoshimitsuTRIADii Finally I wrote YoYo - a collison/fattest-slot reporter for big files. From the 3-gram 'REAL WORLD' tortures, given below, the picture is quite clear: FNV1A_YoshimitsuTRIADii and FNV1A_Yoshimura are excellent for x-gram hashing.The dump below was made with YoYo_r1.zip package.

E:\YoYo_r1>dir Volume in drive E is SSD_Sanmayce Volume Serial Number is 9CF6-FEA3 03/31/2013 05:14 AM 42,153,646,707 enwiki-20121201-pages-articles.xml 03/31/2013 05:14 PM 1,395,608,680 enwiki-20121201-pages-articles.xml_03of32_3-grams.txt 03/31/2013 05:14 PM 9,305,584,848 enwiki-20121201-pages-articles.xml_20of32_3-grams.txt 03/31/2013 05:14 AM 143,360 Leprechaun_x-leton_64bit_Intel_03_32p.exe 03/31/2013 05:14 AM 210 RIP_3-grams.bat 03/31/2013 05:14 AM 1,590 Yorikke prompt.lnk 03/31/2013 05:14 AM 56,361 YoYo_r1.c 03/31/2013 05:14 AM 96,256 YoYo_r1_32bit.exe 03/31/2013 05:14 AM 108,032 YoYo_r1_64bit.exe E:\YoYo_r1>YoYo_r1_64bit.exe enwiki-20121201-pages-articles.xml_20of32_3-grams.txt 26 YoYo - [CR]LF lines hasher, r.1 copyleft Kaze. Note: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 26 Allocating KEY memory 1024KB ... OK Allocating HASH memory 256MB ... OK Allocating HASH memory 256MB ... OK Allocating HASH memory 256MB ... OK Hashing all the LF ending lines encountered in 9,305,584,848 bytes long file ... Keys vs Slots ratio: 6:1 or 452,823,515:67,108,864 FNV1A_Yoshimura : Keys = 00,000,000,000,452,823,515; 000,000,003 x MAXcollisionsAtSomeSlots = 0,000,000,026; HASHfreeSLOTS = 0,000,078,051; HashUtilization = 099%; Collisions = 0,385,792,702 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,452,823,515; 000,000,004 x MAXcollisionsAtSomeSlots = 0,000,000,025; HASHfreeSLOTS = 0,000,078,963; HashUtilization = 099%; Collisions = 0,385,793,614 CRC32C2_8slice : Keys = 00,000,000,000,452,823,515; 000,000,003 x MAXcollisionsAtSomeSlots = 0,000,000,025; HASHfreeSLOTS = 0,000,079,269; HashUtilization = 099%; Collisions = 0,385,793,920 Physical Lines: 452,823,515 Shortest Line : 9 Longest Line : 41 Performance: 33485.613 bytes per clock E:\YoYo_r1>YoYo_r1_64bit.exe enwiki-20121201-pages-articles.xml_20of32_3-grams.txt 27 YoYo - [CR]LF lines hasher, r.1 copyleft Kaze. Note: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 27 Allocating KEY memory 1024KB ... OK Allocating HASH memory 512MB ... OK Allocating HASH memory 512MB ... OK Allocating HASH memory 512MB ... OK Hashing all the LF ending lines encountered in 9,305,584,848 bytes long file ... Keys vs Slots ratio: 3:1 or 452,823,515:134,217,728 FNV1A_Yoshimura : Keys = 00,000,000,000,452,823,515; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,019; HASHfreeSLOTS = 0,004,598,493; HashUtilization = 096%; Collisions = 0,323,204,280 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,452,823,515; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,020; HASHfreeSLOTS = 0,004,598,778; HashUtilization = 096%; Collisions = 0,323,204,565 CRC32C2_8slice : Keys = 00,000,000,000,452,823,515; 000,000,005 x MAXcollisionsAtSomeSlots = 0,000,000,018; HASHfreeSLOTS = 0,004,601,800; HashUtilization = 096%; Collisions = 0,323,207,587 Physical Lines: 452,823,515 Shortest Line : 9 Longest Line : 41 Performance: 29954.242 bytes per clock E:\YoYo_r1>YoYo_r1_64bit.exe enwiki-20121201-pages-articles.xml_20of32_3-grams.txt 28 YoYo - [CR]LF lines hasher, r.1 copyleft Kaze. Note: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 28 Allocating KEY memory 1024KB ... OK Allocating HASH memory 1024MB ... OK Allocating HASH memory 1024MB ... OK Allocating HASH memory 1024MB ... OK Hashing all the LF ending lines encountered in 9,305,584,848 bytes long file ... Keys vs Slots ratio: 1:1 or 452,823,515:268,435,456 FNV1A_Yoshimura : Keys = 00,000,000,000,452,823,515; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,017; HASHfreeSLOTS = 0,049,688,056; HashUtilization = 081%; Collisions = 0,234,076,115 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,452,823,515; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,014; HASHfreeSLOTS = 0,049,692,281; HashUtilization = 081%; Collisions = 0,234,080,340 CRC32C2_8slice : Keys = 00,000,000,000,452,823,515; 000,000,007 x MAXcollisionsAtSomeSlots = 0,000,000,013; HASHfreeSLOTS = 0,049,694,945; HashUtilization = 081%; Collisions = 0,234,083,004 Physical Lines: 452,823,515 Shortest Line : 9 Longest Line : 41 Performance: 27130.143 bytes per clock E:\YoYo_r1>type enwiki-20121201-pages-articles.xml_20of32_3-grams.txt|more z_otoryjski_war volcano_osorno_chile ybgvvcu_bnwm_z support_bridging_or text_sha_mfkzwy rangers_and_auburn sha_frtc_xosccivqrn tictock_studios_which of_parish_founders phonologique_du_loron nascimento_oporto_european mps_superintendent_gregory muluzi_denied_any nir_twolegresult_pfc lawmakers_think_transit more_puffy_stuff l_robert_hamilton k_bjajhv_f krulak_s_ttl lai_meters_require moth_en_route group_oebb_technische further_infanticide_zoology featuring_dj_sak file_self_made flames_attacking_railtrack gauteng_military_museum cornell_capa_ja chuck_which_features bruce_payne_actor castle_guard_being citizen_james_title aronson_s_pictures benjamin_conley_r and_billy_doctrove and_histochemical_staining between_kumagawa_and done_discouraged_most histogram_cells_frequency preserve_redirect_shearson without_providing_casualty political_manoeuvrings_despite reel_detective_larue unseen_asset_that mise_or_mean of_tzotzil_de havins_will_return karateka_xhavit_bajrami locates_the_entity pc_patch_information franz_beckenbauer_dot enjoyed_modest_box et_le_sol dodgers_hit_home dwight_delivers_his film_trade_dts croix_s_direction continued_instruction_more county_winiary_zag dir_north_loop comedian_television_personality checkpoints_and_other citizens_federation_of architecture_international_property The process tried to write to a nonexistent pipe. ^C E:\YoYo_r1>YoYo_r1_64bit.exe enwiki-20121201-pages-articles.xml_03of32_3-grams.txt 25 YoYo - [CR]LF lines hasher, r.1 copyleft Kaze. Note: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 25 Allocating KEY memory 1024KB ... OK Allocating HASH memory 128MB ... OK Allocating HASH memory 128MB ... OK Allocating HASH memory 128MB ... OK Hashing all the LF ending lines encountered in 1,395,608,680 bytes long file ... Keys vs Slots ratio: 2:1 or 67,916,422:33,554,432 FNV1A_Yoshimura : Keys = 00,000,000,000,067,916,422; 000,000,009 x MAXcollisionsAtSomeSlots = 0,000,000,013; HASHfreeSLOTS = 0,004,432,986; HashUtilization = 086%; Collisions = 0,038,794,976 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,067,916,422; 000,000,011 x MAXcollisionsAtSomeSlots = 0,000,000,013; HASHfreeSLOTS = 0,004,433,879; HashUtilization = 086%; Collisions = 0,038,795,869 CRC32C2_8slice : Keys = 00,000,000,000,067,916,422; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,014; HASHfreeSLOTS = 0,004,433,459; HashUtilization = 086%; Collisions = 0,038,795,449 Physical Lines: 67,916,422 Shortest Line : 9 Longest Line : 41 Performance: 35585.014 bytes per clock E:\YoYo_r1>YoYo_r1_64bit.exe enwiki-20121201-pages-articles.xml_03of32_3-grams.txt 26 YoYo - [CR]LF lines hasher, r.1 copyleft Kaze. Note: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 26 Allocating KEY memory 1024KB ... OK Allocating HASH memory 256MB ... OK Allocating HASH memory 256MB ... OK Allocating HASH memory 256MB ... OK Hashing all the LF ending lines encountered in 1,395,608,680 bytes long file ... Keys vs Slots ratio: 1:1 or 67,916,422:67,108,864 FNV1A_Yoshimura : Keys = 00,000,000,000,067,916,422; 000,000,008 x MAXcollisionsAtSomeSlots = 0,000,000,010; HASHfreeSLOTS = 0,024,390,554; HashUtilization = 063%; Collisions = 0,025,198,112 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,067,916,422; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,011; HASHfreeSLOTS = 0,024,396,584; HashUtilization = 063%; Collisions = 0,025,204,142 CRC32C2_8slice : Keys = 00,000,000,000,067,916,422; 000,000,004 x MAXcollisionsAtSomeSlots = 0,000,000,011; HASHfreeSLOTS = 0,024,394,376; HashUtilization = 063%; Collisions = 0,025,201,934 Physical Lines: 67,916,422 Shortest Line : 9 Longest Line : 41 Performance: 31138.773 bytes per clock E:\YoYo_r1>YoYo_r1_64bit.exe enwiki-20121201-pages-articles.xml_03of32_3-grams.txt 27 YoYo - [CR]LF lines hasher, r.1 copyleft Kaze. Note: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long. Polynomial(s) used: CRC32C2_8slice: 0x8F6E37A0 HashSizeInBits = 27 Allocating KEY memory 1024KB ... OK Allocating HASH memory 512MB ... OK Allocating HASH memory 512MB ... OK Allocating HASH memory 512MB ... OK Hashing all the LF ending lines encountered in 1,395,608,680 bytes long file ... Keys vs Slots ratio: 0:1 or 67,916,422:134,217,728 FNV1A_Yoshimura : Keys = 00,000,000,000,067,916,422; 000,000,011 x MAXcollisionsAtSomeSlots = 0,000,000,008; HASHfreeSLOTS = 0,080,915,727; HashUtilization = 039%; Collisions = 0,014,614,421 FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,067,916,422; 000,000,014 x MAXcollisionsAtSomeSlots = 0,000,000,008; HASHfreeSLOTS = 0,080,922,174; HashUtilization = 039%; Collisions = 0,014,620,868 CRC32C2_8slice : Keys = 00,000,000,000,067,916,422; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,080,922,279; HashUtilization = 039%; Collisions = 0,014,620,973 Physical Lines: 67,916,422 Shortest Line : 9 Longest Line : 41 Performance: 27233.514 bytes per clock E:\YoYo_r1>



Update, 2013-Mar-26: Just wanted to see how I&I YoshimitsuTRIAD behaves, along with this a new more versatile 'Lucky Light' benchmark came out featuring XMM (plus 8DWORDS interleaved) memory reads. Speaking only of linear speed, FNV1A_YoshimitsuTRIADii outslashed all the 32bit hashers known to me (including FNV1A_Yoshimura).







Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):

Linear speed on Sanmayce's 'Bonboniera' laptop (Core 2 T7500, 2200MHz, RAM bus: 666MHz Duad Channel):

Just wanted to see how I&I YoshimitsuTRIAD behaves, along with this a new more versatile 'Lucky Light' benchmark came out featuring XMM (plus 8DWORDS interleaved) memory reads. Speaking only of linear speed, FNV1A_YoshimitsuTRIADii outslashed all the 32bit hashers known to me (including FNV1A_Yoshimura).Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):Linear speed on Sanmayce's 'Bonboniera' laptop (Core 2 T7500, 2200MHz, RAM bus: 666MHz Duad Channel):

BURST_Read_4XMM128bit: (64MB block); 524288MB fetched in 107765 clocks or 4.865MB per clock BURST_Read_8XMM128bit: (64MB block); 524288MB fetched in 113085 clocks or 4.636MB per clock BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15148 clocks or 4.326MB per clock BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 14087 clocks or 4.652MB per clock FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14539 clocks or 4.508MB per clock !!! Fastest of all 4 Yo* !!! CRC32_SlicingBy8: (64MB block); 65536MB hashed in 71557 clocks or 0.916MB per clock BURST_Read_4XMM128bit: (128KB block); 524288MB fetched in 37580 clocks or 13.951MB per clock BURST_Read_8XMM128bit: (128KB block); 524288MB fetched in 36380 clocks or 14.411MB per clock BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9220 clocks or 7.108MB per clock ??? 128KB block size suggests some setback for Interleaved approach ??? BURST_Read_8DWORDSi: (128KB block); 65536MB fetched in 10577 clocks or 6.196MB per clock !!! Both Reading&Hashing suffer, thanks to Fantasy I saw that on i7 this drawback is no more !!! FNV1A_YoshimitsuTRIADii: (128KB block); 65536MB hashed in 10686 clocks or 6.133MB per clock !!! Both Reading&Hashing suffer, thanks to Fantasy I saw that on i7 this drawback is no more !!! CRC32_SlicingBy8: (128KB block); 65536MB hashed in 69795 clocks or 0.939MB per clock BURST_Read_4XMM128bit: (16KB block); 524288MB fetched in 20436 clocks or 25.655MB per clock BURST_Read_8XMM128bit: (16KB block); 524288MB fetched in 17456 clocks or 30.035MB per clock BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 7909 clocks or 8.286MB per clock FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8923 clocks or 7.345MB per clock !!! Fastest of all 4 Yo* !!! CRC32_SlicingBy8: (16KB block); 65536MB hashed in 69732 clocks or 0.940MB per clock // 'BURST_Read_4XMM128bit' Main Loop: .B1.182: 009d0 8b f1 mov esi, ecx 009d2 41 inc ecx 009d3 c1 e6 06 shl esi, 6 009d6 81 f9 00 00 80 00 cmp ecx, 8388608 009dc 66 0f 6f 04 32 movdqa xmm0, XMMWORD PTR [edx+esi] 009e1 66 0f 7f 05 00 00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0 ;;; xmm1 = xmmload(pointerflush+(KT+1)*16); 009e9 66 0f 6f 4c 32 10 movdqa xmm1, XMMWORD PTR [16+edx+esi] 009ef 66 0f 7f 0d 00 00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1 ;;; xmm2 = xmmload(pointerflush+(KT+2)*16); 009f7 66 0f 6f 54 32 20 movdqa xmm2, XMMWORD PTR [32+edx+esi] 009fd 66 0f 7f 15 00 00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2 ;;; xmm3 = xmmload(pointerflush+(KT+3)*16); 00a05 66 0f 6f 5c 32 30 movdqa xmm3, XMMWORD PTR [48+edx+esi] 00a0b 66 0f 7f 1d 00 00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3 00a13 72 bb jb .B1.182 // 'BURST_Read_8XMM128bit' Main Loop: .B1.194: 00b00 8b d9 mov ebx, ecx 00b02 41 inc ecx 00b03 c1 e3 07 shl ebx, 7 00b06 81 f9 00 00 40 00 cmp ecx, 4194304 00b0c 66 0f 6f 04 1a movdqa xmm0, XMMWORD PTR [edx+ebx] 00b11 66 0f 7f 05 00 00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0 ;;; xmm1 = xmmload(pointerflush+(KT+1)*16); 00b19 66 0f 6f 4c 1a 10 movdqa xmm1, XMMWORD PTR [16+edx+ebx] 00b1f 66 0f 7f 0d 00 00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1 ;;; xmm2 = xmmload(pointerflush+(KT+2)*16); 00b27 66 0f 6f 54 1a 20 movdqa xmm2, XMMWORD PTR [32+edx+ebx] 00b2d 66 0f 7f 15 00 00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2 ;;; xmm3 = xmmload(pointerflush+(KT+3)*16); 00b35 66 0f 6f 5c 1a 30 movdqa xmm3, XMMWORD PTR [48+edx+ebx] 00b3b 66 0f 7f 1d 00 00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3 ;;; xmm4 = xmmload(pointerflush+(KT+4)*16); 00b43 66 0f 6f 64 1a 40 movdqa xmm4, XMMWORD PTR [64+edx+ebx] 00b49 66 0f 7f 25 00 00 00 00 movdqa XMMWORD PTR [_xmm4], xmm4 ;;; xmm5 = xmmload(pointerflush+(KT+5)*16); 00b51 66 0f 6f 6c 1a 50 movdqa xmm5, XMMWORD PTR [80+edx+ebx] 00b57 66 0f 7f 2d 00 00 00 00 movdqa XMMWORD PTR [_xmm5], xmm5 ;;; xmm6 = xmmload(pointerflush+(KT+6)*16); 00b5f 66 0f 6f 74 1a 60 movdqa xmm6, XMMWORD PTR [96+edx+ebx] 00b65 66 0f 7f 35 00 00 00 00 movdqa XMMWORD PTR [_xmm6], xmm6 ;;; xmm7 = xmmload(pointerflush+(KT+7)*16); 00b6d 66 0f 6f 7c 1a 70 movdqa xmm7, XMMWORD PTR [112+edx+ebx] 00b73 66 0f 7f 3d 00 00 00 00 movdqa XMMWORD PTR [_xmm7], xmm7 00b7b 72 83 jb .B1.194 // 'BURST_Read_4DWORDS' Main Loop: .B2.3: 02e4c 83 c3 f0 add ebx, -16 ;;; hash32 = *(uint32_t *)(p+0); 02e4f 8b 07 mov eax, DWORD PTR [edi] ;;; hash32B = *(uint32_t *)(p+4); 02e51 8b 77 04 mov esi, DWORD PTR [4+edi] ;;; hash32C = *(uint32_t *)(p+8); 02e54 8b 57 08 mov edx, DWORD PTR [8+edi] ;;; hash32D = *(uint32_t *)(p+12); 02e57 8b 4f 0c mov ecx, DWORD PTR [12+edi] 02e5a 83 c7 10 add edi, 16 02e5d 83 fb 10 cmp ebx, 16 02e60 73 ea jae .B2.3 // 'BURST_Read_8DWORDSi' Main Loop: .B3.3: ;;; for(; Loop_Counter; Loop_Counter--, p += 4*sizeof(uint32_t)) { ;;; hash32 = *(uint32_t *)(p+0) ^ *(uint32_t *)(p+0+Second_Line_Offset); 02ebc 8b 07 mov eax, DWORD PTR [edi] ;;; hash32B = *(uint32_t *)(p+4) ^ *(uint32_t *)(p+4+Second_Line_Offset); 02ebe 8b 77 04 mov esi, DWORD PTR [4+edi] ;;; hash32C = *(uint32_t *)(p+8) ^ *(uint32_t *)(p+8+Second_Line_Offset); 02ec1 8b 57 08 mov edx, DWORD PTR [8+edi] ;;; hash32D = *(uint32_t *)(p+12) ^ *(uint32_t *)(p+12+Second_Line_Offset); 02ec4 8b 4f 0c mov ecx, DWORD PTR [12+edi] 02ec7 33 04 1f xor eax, DWORD PTR [edi+ebx] 02eca 33 74 1f 04 xor esi, DWORD PTR [4+edi+ebx] 02ece 33 54 1f 08 xor edx, DWORD PTR [8+edi+ebx] 02ed2 33 4c 1f 0c xor ecx, DWORD PTR [12+edi+ebx] 02ed6 83 c7 10 add edi, 16 02ed9 4d dec ebp 02eda 75 e0 jne .B3.3 // 'FNV1A_YoshimitsuTRIADii' Main Loop: .B4.4: ;;; hash32 = (hash32 ^ (_rotl_KAZE(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME; 02f4c 8b 2b mov ebp, DWORD PTR [ebx] 02f4e c1 c5 05 rol ebp, 5 02f51 33 2c 33 xor ebp, DWORD PTR [ebx+esi] 02f54 33 fd xor edi, ebp ;;; hash32B = (hash32B ^ (_rotl_KAZE(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME; 02f56 8b 6c 33 04 mov ebp, DWORD PTR [4+ebx+esi] 02f5a c1 c5 05 rol ebp, 5 02f5d 33 6b 04 xor ebp, DWORD PTR [4+ebx] 02f60 33 cd xor ecx, ebp ;;; hash32C = (hash32C ^ (_rotl_KAZE(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+8+Second_Line_Offset))) * PRIME; 02f62 8b 6b 08 mov ebp, DWORD PTR [8+ebx] 02f65 c1 c5 05 rol ebp, 5 02f68 33 6c 33 08 xor ebp, DWORD PTR [8+ebx+esi] 02f6c 83 c3 0c add ebx, 12 02f6f 33 c5 xor eax, ebp 02f71 69 ff e7 d3 0a 00 imul edi, edi, 709607 02f77 69 c9 e7 d3 0a 00 imul ecx, ecx, 709607 02f7d 69 c0 e7 d3 0a 00 imul eax, eax, 709607 02f83 4a dec edx 02f84 75 c6 jne .B4.4 ; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011"; ; mark_description "-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FAcs";

And the full dump:



Copying a 256MB block 1024 times i.e. 256GB READ + 256GB WRITTEN ... memcpy(): (256MB block); 262144MB copied in 141321 clocks or 1.855MB per clock Fetching a 512MB block 1024 times i.e. 512GB ... BURST_Read_4XMM128bit: (512MB block); 524288MB fetched in 107812 clocks or 4.863MB per clock Fetching a 512MB block 1024 times i.e. 512GB ... BURST_Read_8XMM128bit: (512MB block); 524288MB fetched in 113162 clocks or 4.633MB per clock Fetching a 64MB block 8*1024 times i.e. 512GB ... BURST_Read_4XMM128bit: (64MB block); 524288MB fetched in 107765 clocks or 4.865MB per clock Fetching a 64MB block 8*1024 times i.e. 512GB ... BURST_Read_8XMM128bit: (64MB block); 524288MB fetched in 113085 clocks or 4.636MB per clock Fetching a 128KB block 4*1024*1024 times i.e. 512GB ... BURST_Read_4XMM128bit: (128KB block); 524288MB fetched in 37580 clocks or 13.951MB per clock Fetching a 128KB block 4*1024*1024 times i.e. 512GB ... BURST_Read_8XMM128bit: (128KB block); 524288MB fetched in 36380 clocks or 14.411MB per clock Fetching a 16KB block 8*4*1024*1024 times i.e. 512GB ... BURST_Read_4XMM128bit: (16KB block); 524288MB fetched in 20436 clocks or 25.655MB per clock Fetching a 16KB block 8*4*1024*1024 times i.e. 512GB ... BURST_Read_8XMM128bit: (16KB block); 524288MB fetched in 17456 clocks or 30.035MB per clock Fetching/Hashing a 64MB block 1024 times i.e. 64GB ... BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15148 clocks or 4.326MB per clock BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 14087 clocks or 4.652MB per clock FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14539 clocks or 4.508MB per clock FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 15912 clocks or 4.119MB per clock FNV1A_Yorikke: (64MB block); 65536MB hashed in 16427 clocks or 3.990MB per clock FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14695 clocks or 4.460MB per clock CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71557 clocks or 0.916MB per clock Fetching/Hashing a 2MB block 32*1024 times ... BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9532 clocks or 6.875MB per clock BURST_Read_8DWORDSi: (2MB block); 65536MB fetched in 8907 clocks or 7.358MB per clock FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 9407 clocks or 6.967MB per clock FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB per clock FNV1A_Yorikke: (2MB block); 65536MB hashed in 10187 clocks or 6.433MB per clock FNV1A_Yoshimura: (2MB block); 65536MB hashed in 9984 clocks or 6.564MB per clock CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69763 clocks or 0.939MB per clock Fetching/Hashing a 128KB block 512*1024 times ... BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9220 clocks or 7.108MB per clock BURST_Read_8DWORDSi: (128KB block); 65536MB fetched in 10577 clocks or 6.196MB per clock FNV1A_YoshimitsuTRIADii: (128KB block); 65536MB hashed in 10686 clocks or 6.133MB per clock FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 9656 clocks or 6.787MB per clock FNV1A_Yorikke: (128KB block); 65536MB hashed in 10094 clocks or 6.493MB per clock FNV1A_Yoshimura: (128KB block); 65536MB hashed in 11278 clocks or 5.811MB per clock CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 69795 clocks or 0.939MB per clock Fetching/Hashing a 16KB block 4*1024*1024 times ... BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 7909 clocks or 8.286MB per clock FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8923 clocks or 7.345MB per clock FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 8955 clocks or 7.318MB per clock FNV1A_Yorikke: (16KB block); 65536MB hashed in 9719 clocks or 6.743MB per clock FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9734 clocks or 6.733MB per clock CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69732 clocks or 0.940MB per clock

Below, the results after running 64bit code:



E:\Benchmark_LuckyLight_r1>benchmark_Intel_12.1_O2.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-YoshimitsuTRIADii fnv1a-tesla fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT memcpy: 108 ms, 209715202 bytes = 1851 MB/s Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. CityHash128 1.0.3 209715218 (x 1.000) 3389 MB/s 3389 MB/s 277e15 277e15 CityHash64 1.0.3 209715210 (x 1.000) 3389 MB/s 3448 MB/s 277e15 282e15 SpookyHash 2012-03-30 209715218 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 fnv1a-jesteress v2 209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15 fnv1a-yoshimura v2 209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 fnv1a-YoshimitsuTRIADii v2 209715206 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15 fnv1a-tesla v2 209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15 fnv1a-tesla3 v2 209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15 xxhash-fast r3 209715206 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15 xxhash-strong r3 209715206 (x 1.000) 2816 MB/s 2816 MB/s 230e15 230e15 xxhash256 r3 209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. done... (77x1 iteration(s)). E:\Benchmark_LuckyLight_r1>benchmark_Intel_12.1_O3.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-YoshimitsuTRIADii fnv1a-tesla fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT memcpy: 108 ms, 209715202 bytes = 1851 MB/s Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. CityHash128 1.0.3 209715218 (x 1.000) 3389 MB/s 3278 MB/s 277e15 268e15 CityHash64 1.0.3 209715210 (x 1.000) 3389 MB/s 3448 MB/s 277e15 282e15 SpookyHash 2012-03-30 209715218 (x 1.000) 3703 MB/s 3703 MB/s 303e15 303e15 fnv1a-jesteress v2 209715206 (x 1.000) 3389 MB/s 3389 MB/s 277e15 277e15 fnv1a-yoshimura v2 209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 fnv1a-YoshimitsuTRIADii v2 209715206 (x 1.000) 3846 MB/s 3846 MB/s 315e15 315e15 fnv1a-tesla v2 209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15 fnv1a-tesla3 v2 209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15 xxhash-fast r3 209715206 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15 xxhash-strong r3 209715206 (x 1.000) 2816 MB/s 2816 MB/s 230e15 230e15 xxhash256 r3 209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. done... (77x1 iteration(s)). E:\Benchmark_LuckyLight_r1>benchmark_Intel_12.1_fast.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-YoshimitsuTRIADii fnv1a-tesla fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT memcpy: 109 ms, 209715202 bytes = 1834 MB/s Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. CityHash128 1.0.3 209715218 (x 1.000) 2380 MB/s 2380 MB/s 195e15 195e15 CityHash64 1.0.3 209715210 (x 1.000) 2127 MB/s 2173 MB/s 174e15 178e15 SpookyHash 2012-03-30 209715218 (x 1.000) 3703 MB/s 3703 MB/s 303e15 303e15 fnv1a-jesteress v2 209715206 (x 1.000) 3389 MB/s 3389 MB/s 277e15 277e15 fnv1a-yoshimura v2 209715206 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15 fnv1a-YoshimitsuTRIADii v2 209715206 (x 1.000) 3846 MB/s 3846 MB/s 315e15 315e15 fnv1a-tesla v2 209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15 fnv1a-tesla3 v2 209715210 (x 1.000) 4444 MB/s 4444 MB/s 364e15 364e15 xxhash-fast r3 209715206 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15 xxhash-strong r3 209715206 (x 1.000) 2816 MB/s 2816 MB/s 230e15 230e15 xxhash256 r3 209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. done... (77x1 iteration(s)). E:\Benchmark_LuckyLight_r1>benchmark_Microsoft_VS2010_Ox.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-YoshimitsuTRIADii fnv1a-tesla fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT memcpy: 114 ms, 209715202 bytes = 1754 MB/s Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. CityHash128 1.0.3 209715218 (x 1.000) 4444 MB/s 4444 MB/s 364e15 364e15 CityHash64 1.0.3 209715210 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15 SpookyHash 2012-03-30 209715218 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15 fnv1a-jesteress v2 209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15 fnv1a-yoshimura v2 209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15 fnv1a-YoshimitsuTRIADii v2 209715206 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15 fnv1a-tesla v2 209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15 fnv1a-tesla3 v2 209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15 xxhash-fast r3 209715206 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15 xxhash-strong r3 209715206 (x 1.000) 2857 MB/s 2857 MB/s 234e15 234e15 xxhash256 r3 209715234 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15 Codec version args C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff. done... (77x1 iteration(s)).





For reference:

My one-page-manga-like swagger







Update, 2013-Mar-14: A remainderless variant for 16[+] bytes keys appeared while I was playing with Yorikke and wanting to try an old idea of mine - to reduce the branching.



Also, through the tests I have made, a simple maxim popped up: '3 is better than 2', that is, the 3 YoshimitsuTRIAD hash lines disperse better than the 2 Yorikke hash lines.



Thanks to guys from overclock.net I gathered some AMD stats.

Here it is interesting to compare the two (Intel) outputs and the two (AMD) outputs obtained so far:



Core 2 T7500 2200MHz = 11x200, CPU FSB speed: 4x200MHz, RAM bus: 333MHz (DDR2, Dual Channel), L1 Data Cache: 32KB:

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 9048 clocks or 7.243MB per clock

FNV1A_Yorikke: (16KB block); 65536MB hashed in 9626 clocks or 6.808MB per clock

CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69779 clocks or 0.939MB per clock



i7-3930K, 4500MHz, CPU bus: 125MHz, RAM bus: 1200MHz (DDR3, Quad Channel), L1 Data Cache: 32KBytes:

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 4390 clocks or 14.928MB per clock

FNV1A_Yorikke: (16KB block); 65536MB hashed in 5123 clocks or 12.793MB per clock

CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 36210 clocks or 1.810MB per clock



AMD FX-8120, 4515MHz = 21x215, L1 Data Cache: 8x16KB:

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 5859 clocks or 11.186MB per clock

FNV1A_Yorikke: (16KB block); 65536MB hashed in 6407 clocks or 10.229MB per clock

CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 44437 clocks or 1.475MB per clock



AMD Phenom II X6 1600T, 4000MHz = 16x250, FSB Frequency: 250MHz, DRAM Frequency: 1000MHz, Dual DDR3, L1 Data Cache: 6x64KB:

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 5769 clocks or 11.360MB per clock

FNV1A_Yorikke: (16KB block); 65536MB hashed in 7555 clocks or 8.675MB per clock

CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 34650 clocks or 1.891MB per clock



In my view 'Zambezi' 32nm is a step back when compared to 'Thuban' 45nm, AMD instead of jump did a somersault.

Looking at CRC32_SlicingBy8 function, though, AMD did a great job (I believe 'Barton' legacy speaks within 'Thuban') by outperforming the fastest Intel system I have seen thus far - 1.891MB per clock vs 1.810MB per clock.

I wanted to post results obtained on my old computer (with AMD Barton 1920MHz) but the fans were dead.



I wonder how XEON (Ivy-Bridge) and latest AMD (with no reduced L1 cache) fare.



Up to now:

On AMD (Phenom II X6 1600T, 4000MHz) FNV1A_YoshimitsuTRIAD reigns with 11.360MB per clock or 11360/1024 = 11.093GB/s.

On Intel (i7-3930K, 4500MHz) FNV1A_YoshimitsuTRIAD reigns with 14.928MB per clock or 14928/1024 = 14.578GB/s.



It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.

Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.

The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.

The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.

The idea #2 is to STRESS memory by fetching not-so-adjacent areas.



For example:

Key: hash_with_overlapping_aye_aye

Key left-to-right quadruplets and remainder: 'hash', '_wit', 'h_ov', 'erla', 'ppin', 'g_ay', 'e_ay', 'e'

Key right-to-left quadruplets and remainder: 'h', 'ash_', 'with', '_ove', 'rlap', 'ping', '_aye', '_aye'

Key_Length: 29

Loop_Counter: 3 //if ( Key_Length%(3*4) ) Loop_Counter = Key_Length/(3*4)+1; else Loop_Counter = Key_Length/(3*4);



Loop #1 of 3:

Hash line 1: hash

Hash line 2: h_ov

Hash line 3: ping

Loop #2 of 3:

Hash line 1: _wit

Hash line 2: erla

Hash line 3: _aye

Loop #3 of 3:

Hash line 1: h_ov

Hash line 2: ppin

Hash line 3: _aye



I don't know the internals, whether lines are 32/64/128 bytes long is a secondary concern.

Well, the key is too short, in reality the a