[noise] [ANNOUNCE] WireGuard Launched!

Brian Smith writes: > wouldn't the performance be much better on Skylake+ and ARMv8+? No. Concretely, several generations of Intel chips have run 12-round ChaCha12-256 at practically the same speed as 12-round AES-192 (with a similar security margin), even though AES-192 has "hardware support", a smaller key, a smaller block size, and smaller data limits. For example: * Both ciphers are ~1.7 cycles/byte on Westmere (introduced 2010). * Both ciphers are ~1.5 cycles/byte on Ivy Bridge (introduced 2012). * Both ciphers are ~0.8 cycles/byte on Skylake (introduced 2015). ChaCha20-256 is slower than ChaCha12-256 but this is entirely because it has a much larger security margin. For reasons explained below, I wouldn't be surprised to see ChaCha20-256 running _faster_ than AES-256 on future Intel chips. Anyway, it's hard to find any applications that will notice any of these gaps. We're talking about differences under 1 cycle/byte, on CPUs where cycles are very fast in the first place, vs. multiple cycles/byte for typical data processing. Meanwhile the ecosystem includes many smaller CPUs where AES is painfully slow---e.g., ~20 cycles/byte for AES-128 on Cortex-A8. On a small CPU core, spending precious area for an AES instruction isn't an easy decision for a CPU designer to make! The ecosystem also includes, more importantly, many AES implementations vulnerable to timing attacks. > SHA-256 + AES-256-GCM, for which newer CPUs are being optimized No. CPUs are optimized for video games. Video games are a huge market---a market where people pay close attention to performance and adjust their purchasing decisions accordingly. Yes, yes, some buyers pay attention to CPU performance for weather prediction, or movie rendering, or various other important applications. But most of these applications have the same basic bottlenecks as video games: most importantly, low-precision multiplications and additions. Do CPU designers spend area on niche operations such as _binary-field_ multiplication? Sometimes, yes, but not much area. Given how CPUs are actually used, CPU designers see vastly more benefit to spending area on, e.g., vectorized floating-point multipliers. Intel is doubling the vector size in its newest CPUs---again!---this time from 256 bits to 512 bits. The design principles explained in https://cr.yp.to/snuffle/design.pdf (especially parallelization) https://cr.yp.to/snuffle/speed.pdf (see in particular Section 12) allow Salsa20 and ChaCha20 to take direct advantage of this increase in vector size---and also of other improvements that Intel has been adding, such as extra vector units and three-operand instructions. Intel is also adding vectorized rotations (which take very little extra hardware area on top of shifts), so add-shift-shift-merge-xor becomes add-rotate-xor. Poly1305 is similarly parallelizable, with slightly more effort. It also benefits from Intel's continued attention to the speed of multipliers. Intel's newest optimization here is the introduction of very efficient 52-bit multipliers that expose the internals of the floating-point multipliers. This is essentially what was requested in, e.g., * https://cr.yp.to/talks/2002.06.15/slides.ps (last slide, reusing the 64-bit extended-precision floating-point multiplier) and * https://blog.cr.yp.to/20140517-insns.html (reusing the 53-bit double-precision vector floating-point multiplier, very close to what Intel subsequently announced). This will---with very little hardware cost---also speed up conservative prime-field ECC and many further applications. The reasons that CPUs can make simple vector arithmetic run really fast are related, at a fundamental hardware-design level, to the reasons that the same instructions take constant time on most CPUs. The story is quite different for variable data indices: these naturally create variations in timing and are hard to efficiently vectorize. For the same reasons, variable instruction indices (i.e., branches) naturally create variations in timing and are hard to efficiently vectorize. AES was never (and GCM was never) designed to run quickly in software using common arithmetic instructions: it relied critically on variable data indices for performance. This approach turned out to be a security nightmare, producing many successful timing attacks. Even in contexts where timing attacks weren't a concern, this approach vectorized poorly, with performance falling farther and farther behind pure arithmetic designs as CPUs evolved in predictable ways. Adding hardware for AES arithmetic, particularly inversion in GF(256), deals with the timing attacks but costs quite a bit of hardware area for yet another niche operation. Intel decided that on its large cores it could justify the area for 8 inverters, and then for 16 inverters, and probably 32 on future cores---but this is only a small percentage of the overall area of Intel's vector units, and the percentage doesn't seem to be increasing as the years go by. This is why 12-round AES-192 isn't beating 12-round ChaCha12-256. Yes, there's some special-purpose hardware for AES, but there's much more general-purpose hardware for vector computations, and ChaCha takes advantage of this general-purpose hardware. ---Dan