A new kernel for SHA-3 (Keccak) cryptographic hashing has been committed to the DWScript repository.

It is almost 3 times faster than the Pascal version, makes use of MMX asm, and involved an “ad hoc compiler”.



Keccak was the winner of the NIST hashing competition for a new hashing algorithm to provide an alternative to SHA-256. It became SHA-3 and was named by NIST a FIPS 202 hashing standard in 2015.

The new kernel bring SHA-3 hashing performance in 32bits to a similar level as SHA-2 (using Arnaud Bouchez’s implementation as reference), so there is no longer any performance penalty in choosing SHA-3 over SHA-2.

It is also faster and more compact than the MMX kernel previously posted on Wolfgang Ehrhardt’s page, through reduced memory traffic and improved register usage.

A 64bit asm version is also provided, it is faster than what the Delphi compiler generates, but cannot yet be considered as optimized (it will eventually get mashed through the ad hoc compiler as well).

The implementation can be found in dwsSHA3.pas in Libraries/CryptoLib path of the DWScript repository.

Ad hoc compiler

The kernel was generated by a very specific compiler, which “parsed” Pascal lines of code, “assembled” them, before applying some greedy optimizations. Since Keccak consists from internal functions which are order-independent, an optimizer would then shuffle the Pascal code, and run them again through the compiler, with the aim of minimizing memory accesses. Rinse and repeat.

By ad hoc parser, I mean “something” that takes lines like (exactly like)

A[00] := B[00] xor ((not B[01]) and B[02]);

A[01] := B[01] xor ((not B[02]) and B[03]);

A[02] := B[02] xor ((not B[03]) and B[04]);

A[03] := B[03] xor ((not B[04]) and B[00]);

A[04] := B[04] xor ((not B[00]) and B[01]);

A[05] := B[05] xor ((not B[06]) and B[07]); etc.

and processes them using code like

procedure ProcessChi(line : String; code : array of String);

begin

var i := line.After(‘A[‘).Before(‘]’).ToInteger;

var j := line.After(‘B[‘).Before(‘]’).ToInteger;

var k := line.After(‘not B[‘).Before(‘]’).ToInteger;

var l := line.After(‘and B[‘).Before(‘]’).ToInteger;

var x := (i and 7).ToString; code.Add(‘ movq mm’ + x + ‘, [eax’ + Off(k*8-128) + ‘]’);

code.Add(‘ pandn mm’ + x + ‘, [eax’ + Off(l*8-128) + ‘]’);

code.Add(‘ pxor mm’ + x + ‘, [eax’ + Off(j*8-128) + ‘]’);

code.Add(‘ movq [edx’ + Off(i*8-128) + ‘], mm’ + x);

end;

…more ad hoc than that would take a regexp, but regexp are not meant for humans beings.

Ad hoc optimizer

The resulting asm code goes through a “greedy” optimizer, which is just as ad hoc. It applies the following optimizations:

moves memory reads earlier in the code (and stops when it finds the register name on a line, or a memory write, a write being defined as an ‘[‘ after a ‘,’)

eliminates memory reads when the corresponding data is already in a register (search and replace)

eliminates unnecessary memory writes (a write being defined as an ‘[‘ before a ‘,’, and an unnecessary write is when what is between ‘[‘ and ‘]’ can be found only once in the code)

Keccak kernel is comprised of four internal functions, which are regrouped into four code blocks (Theta1, Theta2, RhoPi and Chi). Within those blocks, the order of lines of code can be changed without affecting the outcome.

So the optimizing part of the compiler randomly shuffles those lines within the code blocks, before running them through the above steps. This performs a stochastic gradient descent, and the state is periodically (and randomly) reset, so it share properties with simulated annealing.

In layman terms, it is bashing like a dumb brute at the problem. A bit like a monkey tester.

“Winners” of the above process are found after just a few thousand attempts (mere seconds), one them is the version in sha3_mmx.inc. There are many winners, because the MMX registers are interchangeabled, but interestingly enough they all looked the same. I tried to let the optimizer run for hours, and it found nothing better, so while it is likely not an absolute optimal, it is probably an optimum given the current “compiler” limitations!