The cr.yp.to blog

2017.07.23: Fast-key-erasure random-number generators

A stream cipher expands a secret key into a long stream of random bytes. The standard security goal for a stream cipher is easy to explain: the attacker can't distinguish the output bytes from independent uniform random bytes. Saying that a random byte is "uniform" means that each of the 28 possible bytes appears with probability 1/28; "independent" means that the probability of a sequence of bytes is the product of the probabilities of the individual bytes.

There are many ciphers whose secret keys are too short, making it impossible for them to achieve this security goal. But for the moment let's focus on the AES-256-CTR stream cipher, using a uniform random 256-bit key k (probability 1/2256 of each particular 256-bit string). That's a long enough key: the attacker has a practically nonexistent chance of ever guessing k, even with a future quantum computer running Grover's algorithm. Saying "can't distinguish" doesn't mean zero chance; it means negligible chance for any reasonable cost.

The central question in the AES literature is whether there's any feasible attack with a noticeable probability of distinguishing the 128-bit output blocks AES k (0), AES k (1), AES k (2), ... from a uniform random sequence of distinct blocks. The research details inspire confidence in the security of AES: any successful attack would be a huge breakthrough in cryptanalysis.

Wait a minute: "distinct blocks"? "Distinct" wasn't part of the security goal! Inspecting b independent uniform random 128-bit blocks will find a collision with probability close to b(b−1)/2129; any collision shows immediately that these are not AES output blocks. As b grows, this becomes an increasingly severe failure of AES-256-CTR to reach the standard security goal.

But let's assume for the moment that b is small enough that this isn't a problem. Then it's safe to rely on AES-256-CTR.

Key erasure

The main topic of this blog post is designing a high-security random-number generator (RNG). Sounds like this is solved by AES-256-CTR, right? Presumably, by hashing enough data from non-malicious entropy sources, we can produce a 256-bit key that's completely unpredictable for the attacker, i.e., indistinguishable from a uniform random key. Then we run AES-256-CTR using this key to generate all the randomness that applications need.

But suppose an attacker steals your computer and looks at what's stored in memory. Can the attacker figure out random numbers that were previously generated? Yes: the AES-256-CTR key k was never erased, so the attacker can compute the whole historical sequence of random outputs: AES k (0), AES k (1), AES k (2), ...

This is actively dangerous if you're relying on the "forward secrecy" of short-term-public-key systems. You use the RNG to generate, e.g., a one-minute ECC key or a single-use New Hope key; you receive data encrypted to that key; you erase the secret key and the plaintext (after reading the plaintext); you then expect that you're safe against an attacker who recorded the ciphertext, even if the attacker subsequently steals your computer. But this expectation is sabotaged by the RNG. In academic terminology, this is a failure of "forward security" of the RNG; in NIST terminology, it is a failure of "backtracking resistance".

Fortunately, there's an easy fix:

Starting from the 256-bit key k, generate (say) 48 blocks B 0 = AES k (0), B 1 = AES k (1), ..., B 47 = AES k (47). This is a total of 768 bytes of AES output.

= AES (0), B = AES (1), ..., B = AES (47). This is a total of 768 bytes of AES output. Immediately overwrite the key k with the first two blocks B 0 , B 1 .

with the first two blocks B , B . Use the other blocks B 2 , ..., B 47 as 736 bytes of RNG output, of course erasing each byte as soon as it is consumed .

, ..., B as 736 bytes of RNG output, of course . Start over with the new key.

This runs at practically the full speed of AES-CTR. An application that asks for short packets of randomness has forward security immediately after each packet, without having to pay for any extra AES computations.

This RNG construction certainly isn't new but I don't recall ever hearing a good name for it. I'm going to call it a fast-key-erasure RNG, recognizing two aspects of the RNG design: first, the RNG erases k the moment that it uses k; second, keys generated as output from the RNG are immediately erased from the RNG, so the RNG is suitable for applications that erase their keys promptly.

The rest of this blog post discusses alternatives, implementation, and security analysis.

NIST's pointlessly slow RNGs

NIST has stated that its standard "DRBGs" do "an extra step at the end of each request for random bytes" for backtracking resistance (forward security). More broadly, NIST has claimed that any crypto library's "get random bytes" call is "going to do this extra cryptographic work to ensure backtracking resistance".

As far as I can tell, this isn't a political claim regarding the future popularity of NIST's current RNGs; instead it's a technical claim regarding the work that an RNG must do for forward security. But this technical claim is wrong, as illustrated by fast-key-erasure RNGs.

Here's what's really weird about this. NIST describes its RNGs as "random bit generators". NIST emphasizes forward security as an important feature. But NIST's RNGs incur massive costs if the user actually wants forward secrecy after every random bit. Was there never an effort to optimize the RNGs to provide both advertised features simultaneously?

NIST has an "AES-CTR-DRBG" RNG based on AES-CTR, but this RNG doesn't erase the key the instant that the key is used. The RNG starts generating blocks of AES output for the caller. The RNG continues generating enough blocks for all the randomness that the caller wants. Finally, once the caller is satisfied, the RNG generates further AES output to overwrite its key. If the caller merely wanted, say, 1 byte of random data, then there's

an AES call to generate the byte of random data, and then

an AES call to generate a new key, and

another AES call if the user has sensibly opted for a 256-bit key (one 128-bit AES output block isn't enough for a new 256-bit key), and

another AES call because NIST also maintains a random AES input block whose security benefit never seems to have been analyzed.

That's 64 bytes of AES output, if I'm correctly deciphering the NIST standard, for 1 byte of useful random data.

It's entirely possible that submissions to NIST's "Post-Quantum Cryptography Project" will be seriously slowed down by NIST's AES-CTR-DRBG. In reaction to this possibility, NIST seems to be recommending that submissions use an extended AES-CTR-DRBG interface that handles a request for b 1 bytes, a subsequent request for b 2 bytes, a subsequent request for b 3 bytes, etc. by implicitly generating a single AES-CTR-DRBG output with more than b 1 +b 2 +b 3 +... bytes and returning segments of this output. This is compatible with answering each request immediately: the extended interface returns the first b 1 bytes of output without knowing b 2 etc., and without knowing the total number of bytes that will be required for this AES-CTR-DRBG output.

From a security perspective, this is just like using AES-CTR. Unless the caller goes to extra effort to end the AES-CTR-DRBG output, the AES-CTR key is retained after each call to the extended interface, so the "backtracking resistance" that NIST advertises for AES-CTR-DRBG is destroyed.

The extended RNG interface is dangerous because the easiest way to use it fails to erase keys. NIST's DRBGs are dangerous because their poor speed encourages this sort of interface.

A better approach would be to use a proper call to AES-CTR-DRBG to fill up, say, a 736-byte RNG output buffer; when I say "proper call" I mean that AES-CTR-DRBG would then overwrite its old key. Random bytes would then be provided and immediately erased from the buffer, and AES-CTR-DRBG would be called again after 736 bytes. This is almost as small and almost as fast as a fast-key-erasure RNG, but it has a significantly more complicated specification and no advantages.

Fast-key-erasure RNGs in SUPERCOP

My previous blog post reported news regarding the SUPERCOP benchmarking package. Part of the news was that SUPERCOP uses a fast-key-erasure RNG, but I didn't explain what this was or how it's implemented. Internally, the RNG is modularized as follows:

There was already a crypto_stream_aes256ctr(out,outlen,n,k) function producing a stream of AES-256 output AES k (0),AES k (1),... from a key k , stopping at outlen bytes. This function has an extra input, a "nonce" n , that starts the AES block counter at something other than 0, but the RNG always takes the nonce to be 0. SUPERCOP currently has two C implementations of this function: a self-contained implementation from Romain Dolbeau using AES-NI, and an implementation that simply calls OpenSSL (assuming OpenSSL is available on the system).

function producing a stream of AES-256 output AES (0),AES (1),... from a key , stopping at bytes. This function has an extra input, a "nonce" , that starts the AES block counter at something other than 0, but the RNG always takes the nonce to be 0. SUPERCOP currently has two C implementations of this function: a self-contained implementation from Romain Dolbeau using AES-NI, and an implementation that simply calls OpenSSL (assuming OpenSSL is available on the system). There's a new crypto_rng_aes256(r,n,k) that takes as input a 32-byte key k and produces two outputs: a new 32-byte key n (blocks B 0 = AES k (0) and B 1 = AES k (1)), and a 736-byte RNG output r (blocks B 2 = AES k (2), ..., B 47 = AES k (47)). There are also similar new functions crypto_rng_salsa20 and crypto_rng_chacha20 , using my Salsa20 and ChaCha20 stream ciphers.

that takes as input a 32-byte key and produces two outputs: a new 32-byte key (blocks B = AES (0) and B = AES (1)), and a 736-byte RNG output (blocks B = AES (2), ..., B = AES (47)). There are also similar new functions and , using my Salsa20 and ChaCha20 stream ciphers. fastrandombytes is a fast-key-erasure RNG. It maintains a buffer of crypto_rng output, providing and erasing bytes from the buffer whenever someone calls randombytes . It also maintains a key; it uses this key to generate a new key and refill the buffer whenever the buffer is empty. fastrandombytes automatically selects whichever is fastest out of crypto_rng_aes256 , crypto_rng_salsa20 , and crypto_rng_chacha20 .

is a fast-key-erasure RNG. It maintains a buffer of output, providing and erasing bytes from the buffer whenever someone calls . It also maintains a key; it uses this key to generate a new key and refill the buffer whenever the buffer is empty. automatically selects whichever is fastest out of , , and . The initial key for fastrandombytes comes from another layer, kernelrandombytes , which calls /dev/urandom to obtain random bytes from the operating-system kernel. There's also some preliminary work to upgrade to getentropy and getrandom .

comes from another layer, , which calls to obtain random bytes from the operating-system kernel. There's also some preliminary work to upgrade to and . knownrandombytes is a reproducible-output version of fastrandombytes , as mentioned in my previous blog post. Specifically, it starts from key 0, and it always uses crypto_rng_chacha20 . This is what SUPERCOP now uses for automatic known-answer tests.

I generally recommend against AES for production use, for two reasons. First, AES provides a significantly worse security/speed tradeoff than state-of-the-art ciphers. This doesn't matter for most applications, but sometimes it does matter. RC4 would have been eliminated many years earlier if AES had been faster than RC4 on common CPUs rather than slower. Anecdotal evidence suggests that almost everyone who deploys AES-128 today instead of AES-256 is doing it either because "AES-256 is 40% slower" or because they're copying the choice from someone else; but AES-128, like other 128-bit block ciphers, is at risk from batch attacks and quantum attacks.

Second, natural AES software implementations are vulnerable to cache-timing attacks. Maybe the RNG keys are hard to break because each key is used only a few times, but maybe not; why take the risk? If the CPU has Intel's AES-NI or similar AES hardware, then this problem disappears (and crypto_rng_aes256 becomes slightly faster than crypto_rng_salsa20 and crypto_rng_chacha20 ), but it's still dangerous to specify AES in systems that will be implemented on multiple platforms; it's safer to specify a cipher that doesn't have the problem in the first place.

Age is worth something, and AES (1998) is older than Salsa20 (2005). On the other hand, Salsa20 has a bigger block size, a larger security margin against all known attacks, extensive review during and after the eSTREAM project, and huge implementation advantages. I'm not so sure that it's right to choose ChaCha20 (2008) over Salsa20: the slight advantages that motivated the ChaCha20 design do seem to be holding up after further analysis, but they aren't as compelling as the advantages of Salsa20 (and ChaCha20) over AES.

Fast-key-erasure RNGs in production

Many cryptographic libraries have RNGs that aim for the same security goals but that are more complicated and harder to audit (and slower). I recommend that these libraries switch to a fast-key-erasure RNG.

SUPERCOP doesn't make any guarantees of having been audited, and in particular fastrandombytes will need to be audited before deployment. Some libraries go to extra effort for thread-safety, fork-safety, etc., which means adding and auditing extra code. What I'm recommending here isn't particular software but particular mathematical functions: random numbers should be generated by a fast-key-erasure RNG as described above.

These library RNGs are typically seeded from the RNG in the operating-system kernel, the same way that SUPERCOP's fastrandombytes uses kernelrandombytes as described above. A simpler approach, taken in NaCl, is for the cryptographic library to simply use the kernel RNG without maintaining another RNG layer. Obviously this approach uses less code and is easier to audit; I see no security justification for OpenBSD saying " getentropy() is not intended for regular code" and limiting the getentropy() output to 256 bytes. The syscall might be a speed problem in some post-quantum systems, but if this turns into a real-world problem then I see several ways to deal with it without the current mess of non-kernel RNG code.

Anyway, it's certainly important for the kernel RNG to be secure. Many kernels have RNGs that are again more complicated and harder to audit (and slower) than a fast-key-erasure RNG. I recommend that these kernels switch to a fast-key-erasure RNG.

Kernels in virtual machines face a clone-safety issue analogous to the userspace fork-safety issue. There are complicated solutions where cloning triggers an RNG reinitialization, but the simplest solution is for the kernel to simply call a hypervisor RNG. A hypervisor interface isn't complete without an RNG.

The kernel (or hypervisor) is responsible for properly seeding its RNG. The central problem here is to generate an initial seed during OS installation. Certainly this problem needs to be solved so that the OS can securely generate (e.g.) ssh keys at installation time. Once this central problem is solved, it's relatively straightforward to seed each subsequent kernel boot: on each boot, the kernel immediately uses the RNG to generate a new random seed for the next boot, and immediately makes sure that the new seed has been safely written to long-term storage, overwriting the old seed. The seed in long-term storage also needs to be changed whenever the filesystem is backed up.

Beware that reliably overwriting data on disk isn't as easy as it sounds, and reliably overwriting data on flash storage is really difficult. Below I'll come back to one consequence of this.

How is a seed generated during OS installation? As mentioned above, this seed is obtained as a hash of data from various entropy sources. Precise CPU cycle counts for DRAM access seem to have considerable entropy on some devices but seem completely predictable on others; auditing this requires studying how different pieces of hardware are clocked. Precise cycle counts for keyboard input, spinning disks, etc. seem very hard for the attacker to predict, but many small computers don't have keyboards and disks. Installing these computers typically means flashing them from a master computer, and this master computer should use its RNG to generate a seed for the small computer. Similarly, whenever a computer creates an installation USB stick, it should use its RNG to generate a seed for the stick; and, whenever the stick is booted, it should update its seed, just like any other kernel boot. Installation from a read-only device should demand keyboard input.

Typically a kernel continues using cycle counts of various events to inject new entropy into its RNG after boot. This is often advertised as providing "backward security" (or "prediction resistance" in NIST terminology), but I'm skeptical that "backward security" has any real-world value. If an attacker has broken security so thoroughly as to be able to see the current RNG state stored securely inside the kernel, then why is it noticeably harder for the attacker to also see any future RNG state of interest? Don't we normally envision the attacker's resources as constantly expanding? People hope that each software security update reduces the attacker's resources, so it makes sense to inject new entropy at that point, but this isn't the same as constantly injecting new entropy.

Aiming for "backward security" has created all sorts of complications whose security isn't clear. One can't simply hash a new cycle count into the RNG state: an attacker who knows the previous RNG state and watches the RNG output can efficiently guess every possibility for the cycle count, and then knows the new RNG state. One has to instead accumulate many cycle counts into a separate hash before injecting that hash into the RNG. At this point the auditor asks what "many" means. The answer includes a mess of questionable "entropy estimation" mechanisms. An alternative, used in FreeBSD, is the relatively pleasant Fortuna from Ferguson and Schneier. (As a side note, I highly recommend reading the Ferguson–Schneier–Kohno "Cryptography engineering" book, which has more detailed coverage of some of the issues I'm covering in this blog post, and also covers many other important issues in cryptography.)

For comparison, the argument for "forward security" is much more convincing: erasing keys has real value against an attacker who was already recording your network traffic and then decides to steal your computer. I think the strongest argument for continuing to inject new entropy is that, as mentioned above, it's hard to reliably overwrite old seeds stored on disk or (especially) flash. A failure to erase the RNG seed can compromise every key-erasure mechanism elsewhere in the system.

Anyway, whether or not you want "backward security", you can and should use a fast-key-erasure RNG. All of the seeding/reseeding options I've mentioned are compatible with using a deterministic, well-tested fast-key-erasure RNG module. The module can simply start with key 0, and provide an inject(e,elen) function that replaces the fast-key-erasure key k with SHA-256(k,SHA-256(e,elen)) and clears the RNG output buffer. It's up to the caller to accumulate enough entropy for these inject calls, and in the "backward security" scenario to accumulate enough entropy in each inject call.

Security level of fast-key-erasure RNGs

Assume that a fast-key-erasure RNG is securely seeded. What's the fastest attack we can come up with against the resulting random numbers? Where are the risks of better attacks?

Concretely, suppose a user reveals 736 gigabytes of RNG output, involving a chain of 230 AES keys. Suppose 230 users all do the same thing starting from independent uniform random seeds, and the attacker sees all 736 exabytes of data. Can the attacker tell that these are in fact RNG output instead of independent uniform random bytes?

Attack 1. The attacker records a tiny fraction of this data, namely the first bit from each 736-byte output block. This is only a fraction of an exabyte, a few million dollars in hard drives.

The attacker guesses a seed, and follows this seed for a chain of 280 AES keys, recording the first bit from each 736-byte output block. There's an obstacle here, namely that doing so many computations serially will be infeasible for the foreseeable future, but let's ignore this problem for a moment and simply focus on how much information is visible from 280 AES computations.

If a user's seed u matches any of the attacker's 280 AES keys, then the user's subsequent outputs will match the subsequent outputs computed by the attacker. Checking, say, the next 256 bits (assuming the key isn't within 256 bits of the end) will see a match that won't happen by chance.

If u doesn't match any of the attacker's keys, it's still entirely possible that the user's next key v = (AES u (0),AES u (1)) will match one of the attacker's keys, and then the next 256 bits from the user will match the next 256 bits computed by the attacker. And so on for subsequent user keys.

The attacker can recognize all of these matches by looking up each substring of 256 consecutive bits in a database of substrings obtained from the users. There's another obstacle here, namely that this imposes massive communication costs, but let's ignore this problem too.

The bottom line is that each of the 230 users has 230 ways to bump into each of the 280 attacker keys, for a total success probability approximately 2140/2256.

This analysis isn't exact. For example, the attacker could break two keys at once, and doesn't get double credit for this; also, almost all keys (all keys after the initial seeds) are obtained as pairs (B 0 ,B 1 ) where B 0 and B 1 are distinct, so denominator 2256−2128 would make more sense than 2256. There could be bigger issues that I've missed. I haven't experimentally verified scaled-down versions of this attack; I would say that there's some risk that the attack is actually much less effective (failing for some reason I didn't think of), but very little risk that the attack is more effective.

Attack 2. This is better than Attack 1 because it drastically reduces the communication costs.

Instead of recording the first bit from each output block, the attacker records the first distinguished output block, meaning the first output block that begins with the three bytes (0,0,0). This is just 236 blocks. Even better, the attacker stores just the last 256 bits of each distinguished block, just a few terabytes of data.

For each of the 280 computed keys, if the corresponding block is distinguished, the attacker looks for the block in the user database. This is just 256 lookups.

If a user seed u matches one of the computed keys, then the first distinguished user block will match the next distinguished computed block. Similarly, if a subsequent user key matches one of the computed keys, then the next distinguished user block will match the next distinguished computed block.

This attack is a few percent less effective than Attack 1, because some user keys aren't followed by any distinguished blocks, but overall the success probability should again be close to 2140/2256.

Attack 3. This is better than Attack 2 because it allows essentially any amount of parallelism. Instead of running one seed through a chain of 280 keys, the attacker tries 256 seeds, running each seed until the first distinguished block.

This is an example of the parallel rho method from 1997 van Oorschot–Wiener, although the application here is slightly different from the usual applications. The parallelism and low communication costs make Attack 3 feasible for serious attackers, and experimental verification by academics can easily go beyond 250 keys.

Again the success probability is approximately 2140/2256. This is an extremely small chance of success, although it is 260 times bigger than one might expect from an attacker trying 280 AES-256 keys.

Unless I'm missing something big, the same attack would be practically guaranteed to succeed for AES-128. Remember, kids: Don't use 128-bit cipher keys!

More attacks. Are there better attacks than what I've described? There are at least four different aspects to this question:

Does AES itself have a serious weakness, allowing the AES output blocks to be distinguished from truly random distinct output blocks? As I mentioned above, this would be a huge breakthrough in cryptanalysis.

How much damage is done by the distinctness of AES output blocks?

Even if AES outputs are indistinguishable from uniform, is there some better way to exploit the structure of the fast-key-erasure RNG?

Do quantum computers allow faster attacks? NIST has claimed specific security levels for AES against Grover's algorithm in a realistic parallel setting, but these security levels will need to be reassessed: Banegas and I have a paper appearing at SAC 2017 on "Low-communication parallel quantum multi-target preimage search".

The PRP-PRF switch

Let's focus for a moment on the second question stated above, namely how much the attacker can benefit from knowing that the 128-bit AES output blocks are distinct.

There's a standard theorem called the "PRP-PRF switch" saying that the attacker's benefit is at most b(b−1)/2129, where b is the number of AES blocks generated. Formally, you're supposed to analyze the success probability of an attack against an ideal secret-key function F that produces independent uniform random outputs for all inputs; then the PRP-PRF switch says that the probability increases by at most b(b−1)/2129 if you instead plug in an ideal secret-key permutation P, which is just like F except for magically guaranteeing that the outputs are distinct; and, finally, we don't know how to distinguish AES k from P without seeing k.

But wait. The users together are generating 48*260 AES blocks. If b = 48*260 then b(b−1)/2129 is larger than 1, so the PRP-PRF switch doesn't say anything.

Is it possible to address this with the slightly-beyond-birthday-bound version of the PRP-PRF switch that I proved in 2005? Or can we use the fact that there are actually many different AES keys, with no guarantee of distinctness across keys? What exactly has been proven about the PRP-PRF switch in this multiple-key scenario? The fast-key-erasure RNG has a much tighter limit on the lifetime of each key than NIST's AES-CTR-DRBG does; does this produce quantitatively better security bounds?

These are good topics for provable-security papers, and for real-world attacks when people screw up the details badly enough. But let me point out a better approach.

Salsa20 has a 512-bit block. ChaCha20 has a 512-bit block. The explicit security goal of Salsa20 and ChaCha20 is for the outputs to be indistinguishable from uniform. There's no funky distinctness qualifier to worry about.

Internally, there's a permutation that always produces distinct 512-bit outputs from distinct 512-bit inputs. Checking for this distinctness is simply one of many failed attack strategies, with success probability so low as to barely be worth mentioning, whereas for AES-256-CTR the distinctness of 128-bit output blocks is the most powerful attack we know.

Of course, if we use ciphers with small block sizes, then it's easy to motivate papers proving something about the damage caused by this weakness. Maybe 128-bit blocks are too big for most cryptographers to really appreciate the danger, but NSA is currently trying to push standardization and deployment of Simon-64-128 and Speck-64-128, two ciphers with tiny 64-bit blocks and 128-bit keys. I was one of about 40 people sitting in a meeting where the speaker, NSA's Louis Wingers (one of the Simon and Speck authors), falsely claimed that counter mode is safe for 64-bit blocks, since counter mode doesn't have block collisions. NSA's continuing promotion of these dangerous ciphers includes perfect sentences to quote in the introductions of "provable security" papers studying small block sizes.

Part of the "provable security" culture is to then praise the resulting systems for having proofs. But avoiding the weakness in the first place is simpler and more robust. The system with more proofs—the system using a cipher with small blocks—is more fragile and harder to audit. The additional proofs are advertised as a sign of safety but are actually a sign of danger.

Broader problems with "provable security"

Proofs sometimes play a useful role in cryptography, and the rest of my blog post will look at some proofs in detail. But there are some important caveats here regarding "provable security":

Proofs are almost never carefully checked. There's a long history of outright proof errors, and sometimes the claimed theorems are unsalvageable. As I put it in a 2015 talk: "Proofs are increasingly complex, rarely reviewed, rarely automated."

There's a long history of proofs that are quantitatively so weak ("loose") that they say nothing about the deployed systems, but that are nevertheless claimed to provide assurance about those systems.

about the deployed systems, but that are nevertheless claimed to provide assurance about those systems. There's a long history of proofs that work in oversimplified attack models and that are blind to attacks outside those models.

There's a long history of people creating dangerous cryptographic structures so as to allow proofs, as illustrated by the Blum–Blum–Shub RNG and the Chaum–van Heijst–Pfitzmann compression function.

so as to allow proofs, as illustrated by the Blum–Blum–Shub RNG and the Chaum–van Heijst–Pfitzmann compression function. There's a long history of gullible standardization agencies and implementors hearing about proofs and then failing to demand thorough review by cryptanalysts.

The canonical starting point to learn more about these problems is the "Another look at provable security" series of papers by Koblitz and Menezes. Menezes's invited talk at Eurocrypt 2012 is a great introduction.

I expect the first three problems to eventually be fixed through computer verification, increased attention to "tightness", and increased attention to the accuracy of security models. But these are huge problems today.

I don't expect the fourth problem to go away (and I'm not sure about the fifth). There's far too much pressure for people to write papers aiming at the fundamental goal of "provable security", namely to prove that complete systems are as secure as primitives. It's straightforward to reach this goal by choosing sufficiently weak primitives, whereas it's difficult, perhaps impossible, to reach this goal in any other way.

The surprisingly complicated literature on proofs of rekeying

With these caveats in mind, I'll now focus on the third question stated above: assuming the cipher outputs are indistinguishable from uniform, is there some better way to exploit the structure of the fast-key-erasure RNG?

Using some cipher output to generate a new key for the cipher (and not using that cipher output in any other way) is an ancient and very frequently used idea. The security intuition is straightforward: if the attacker can't distinguish the cipher output from uniform, then the attacker can't tell the difference between the actual situation and a situation where the new cipher key is generated independently at random.

One would think that there would have been a paper many years ago formalizing this intuition as an easy-to-use theorem and giving a simple, convincing proof. The security of the fast-key-erasure RNG would visibly be a special case of the theorem, and this would be the end of the story.

In fact, the literature on this topic is surprisingly large and surprisingly messy. The same rekeying idea appears under at least three names with separate proofs, as illustrated by the following papers:

The "cascade" construction from 1996 Bellare–Canetti–Krawczyk uses a short-nonce stream cipher S to build a longer-nonce stream cipher T. For example, given key k 0 and nonce (N 1 ,N 2 ), first use S with key k 0 and nonce N 1 to produce a new key k 1 = S(k 0 ,N 1 ); I'm assuming here that the S output for each nonce has the same length as the key. Then use S with key k 1 and nonce N 2 to produce an output T(k 0 ,(N 1 ,N 2 )) = S(S(k 0 ,N 1 ),N 2 ). The paper credits a 1986 Goldwasser–Goldreich–Micali paper with the case of a 1-bit nonce for S, meaning that S converts a key k into a double-length output S(k,0),S(k,1). The 1996 paper allows variable-length nonces for T as long as no nonce is a prefix of another nonce (i.e., as long as T doesn't simply output its internal keys): for example, T can output S(k 0 ,0), S(S(k 0 ,1),0), and S(S(S(k 0 ,1),1),0).

and nonce (N ,N ), first use S with key k and nonce N to produce a new key k = S(k ,N ); I'm assuming here that the S output for each nonce has the same length as the key. Then use S with key k and nonce N to produce an output T(k ,(N ,N )) = S(S(k ,N ),N ). The paper credits a 1986 Goldwasser–Goldreich–Micali paper with the case of a 1-bit nonce for S, meaning that S converts a key k into a double-length output S(k,0),S(k,1). The 1996 paper allows variable-length nonces for T as long as no nonce is a prefix of another nonce (i.e., as long as T doesn't simply output its internal keys): for example, T can output S(k ,0), S(S(k ,1),0), and S(S(S(k ,1),1),0). The "NMAC" and "HMAC" constructions, from a separate 1996 Bellare–Canetti–Krawczyk paper, use a short-input-block compression function S to build a longer-input-message keyed hash function T. For example, given initialization vector k 0 and input message (N 1 ,N 2 ), first use S with initialization vector k 0 and block N 1 to produce a new initialization vector k 1 = S(k 0 ,N 1 ), and then use S with initialization vector k 1 and block N 2 to produce an output T(k 0 ,(N 1 ,N 2 )) = S(S(k 0 ,N 1 ),N 2 ). This construction is identical to the "cascade" (plus an extra element that can be analyzed separately: the output of T, with k 0 chosen as a MAC key, is encrypted to obtain an authenticator), but there's a separate proof in this paper, followed by another separate proof in a 2006 Bellare paper.

and input message (N ,N ), first use S with initialization vector k and block N to produce a new initialization vector k = S(k ,N ), and then use S with initialization vector k and block N to produce an output T(k ,(N ,N )) = S(S(k ,N ),N ). This construction is identical to the "cascade" (plus an extra element that can be analyzed separately: the output of T, with k chosen as a MAC key, is encrypted to obtain an authenticator), but there's a separate proof in this paper, followed by separate proof in a 2006 Bellare paper. The "architecture for robust pseudo-random generation" from 2005 Barak–Halevi uses a "PRG" S that converts a key k into a double-length output S(k,0),S(k,1). It then builds a random-number generator T that outputs S(k,0), S(S(k,1),0), etc. This construction is a special case of the "cascade" but there's again a separate proof. There seems to be considerable extra work here to handle injection of extra entropy, which is a distraction for readers who don't care about "backward security".

The proofs are surprisingly long, given how simple the intuition is. Typically the proofs are buried in appendices; often they're only sketched. There are more papers with more proofs: e.g., my XSalsa20 paper includes a new theorem with a quantitative improvement. People trying to check proofs will obviously be overwhelmed, and it's not surprising that some errors have slipped through:

The theorem stated in 1996 Bellare–Canetti–Krawczyk left out an important "q" factor. Anyone checking the proof would have noticed this omission, but it seems that the first public correction was nine years later .

. The 2006 Bellare paper hypothesizes, in quantitatively applying its NMAC theorem, that "the best attack against" a well-studied compression function "as a PRF is exhaustive key search". But this hypothesis is simply wrong for the (quite standard) definition of PRF used in that paper. This error wasn't pointed out until a paper "Another look at HMAC" by Koblitz and Menezes six years later . The Koblitz–Menezes observation was at first met with denials, but my impression is that the denials stopped after my followup paper with Lange, "Non-uniform cracks in the concrete".

. The Koblitz–Menezes observation was at first met with denials, but my impression is that the denials stopped after my followup paper with Lange, "Non-uniform cracks in the concrete". In trying to rescue Bellare's quantitative results, Koblitz and Menezes made a different mistake related to the PRF definition. This error took only a year to catch (Pietrzak pointed out the error in 2013; fixes appear in the 2013 Pietrzak paper, in a 2013 update of the Koblitz–Menezes paper, and in 2014 Gaži–Pietrzak–Rybár) but obviously the reader isn't left with a feeling of confidence.

This isn't a complete survey of the literature, but adding more information will simply make auditors more worried. For example, a 2006 paper by Campagna sounds at first like it's proving security bounds for AES-CTR-DRBG, but a closer look shows that the paper is only studying AES-CTR and isn't actually proving anything about rekeying.

Do we really believe that all the errors have been eliminated at this point from theorems proving the security of rekeying? One correct application of one correct theorem should be enough, but why is the auditor supposed to believe any particular theorem?

Technical issues creating the mess

Here are four specific issues that bother me about proofs in this area.

Monolithic handling of multiple levels of rekeying. An initial key is used to produce outputs, some of which are used as derived keys for a followup protocol. It seems intuitively clear that any attack has to find non-randomness in the outputs, or find a weakness in the followup protocol.

For example, the initial key k for the fast-key-erasure RNG is used to produce outputs (AES k (0),...,AES k (47)), and then (AES k (0),AES k (1)) are used as a derived key for a followup protocol, namely the same RNG applied recursively. It seems intuitively clear that any attack has to find non-randomness in (AES k (0),...,AES k (47)), or find a weakness in the followup use of (AES k (0),AES k (1)). But the proofs in the literature usually don't work this way: they consider the entire chain or tree of derived keys at once.

The main reason I wrote a new proof in my XSalsa20 paper was to follow the intuition more closely, first proving a theorem about one level of derived keys and then deducing a multi-level theorem by induction. This proof is also considerably shorter than the Bellare–Canetti–Krawczyk "cascade" proof, and I think this reflects a real simplification. But newer papers don't seem to have adopted this strategy.

Oversimplified cost metrics. The "fast" algorithms constructed in many "security proofs"—including the proofs in this area, including the proof in my XSalsa20 paper—are serial algorithms that build giant arrays of random numbers, queries, etc. This can end up dominating the cost of the attack. Maybe these costs can be reduced, as in the improvements from Attack 1 to Attack 2 and Attack 3, but maybe not.

There was a short "Notes on low-memory attacks" subsection in my XSalsa20 paper pointing out this issue. Regarding two arrays U and V of random numbers, I wrote "A standard way to eliminate the space for U and V is to replace random-number generation by pseudorandom-number generation." Apparently this is called "the random oracle technique" in a Crypto 2017 paper by Auerbach–Cash–Fersch–Kiltz, which highlights (and claims to introduce) the topic of "memory tightness" in reductions. I also had some ad-hoc suggestions for eliminating the space for an "array of query prefixes" used in my proof (and in previous proofs).

Non-constructive definitions. The traditional type of security definition considers the chance that a cost-limited algorithm A breaks a cryptographic system X. The problem with this type of definition is that it allows unrealistic attacks A that take a huge amount of time to find: i.e., attacks that allow a huge amount of precomputation. This is what led to the mistake mentioned above in the 2006 Bellare paper.

To exclude such attacks, Lange and I proposed instead considering the chance that a small cost-limited algorithm P prints a cost-limited algorithm A that breaks X. (See Appendix B.4 of "Non-uniform cracks in the concrete".) Any reduction theorem then has to be stated as a theorem about P, not merely a theorem about A. This is compatible with most of the proofs I've mentioned but excludes the proof in the 2006 Bellare paper.

Working with the wrong cipher security metric. Serious attack analysis always has to consider attacks against multiple targets: often there are multiple-key attacks more effective than attacking one key at a time.

From this perspective, it's weird to see theorems that make hypotheses about the security of one cipher key rather than hypotheses about the security of many independent cipher keys. It's similarly weird to see theorems drawing conclusions about the security of one RNG/cascade/NMAC/HMAC/... key instead of the security of many independent keys.

The Bellare–Canetti–Krawczyk "cascade" security proof actually does make a hypothesis about the security of many independent cipher keys, but it then draws a conclusion about the security of just one cascade key. In the talk accompanying my paper, I briefly mentioned that a multi-key hypothesis allowed a tight multi-key conclusion. But I didn't write down proof details at the time, and I didn't realize that focusing on multi-key security allowed a considerably simpler proof.

Decomposing multi-key RNG attacks into multi-key cipher attacks

Let's define G(k) as the string (AES k (0),AES k (1)), and F(k) as the string (AES k (2),AES k (3),...,AES k (47)). Let's focus on the first 1472 bytes of RNG output, F(k) and F(G(k)). Actually, let's generalize a bit to F(k) and H(G(k)), allowing (but not requiring) a different function H to be used after the first 736 bytes.

Say there's an attack A that's given the strings F(k 1 ),H(G(k 1 )); F(k 2 ),H(G(k 2 )); ... F(k U ),H(G(k U )) where k 1 ,k 2 ,...,k U are independent uniform random keys. Can A distinguish these strings from uniform?

The A-distance from these strings to uniform is at most the sum of

the A-distance from (r 1 ,H(s 1 ); r 2 ,H(s 2 ); ...; r U ,H(s U )) to uniform, where r 1 ,s 1 ,r 2 ,s 2 ,...,r U ,s U are independent uniform random keys; and

,H(s ); r ,H(s ); ...; r ,H(s )) to uniform, where r ,s ,r ,s ,...,r ,s are independent uniform random keys; and the A-distance from (F(k 1 ),H(G(k 1 )); F(k 2 ),H(G(k 2 )); ... F(k U ),H(G(k U ))) to (r 1 ,H(s 1 ); r 2 ,H(s 2 ); ...; r U ,H(s U )).

The first distance is the same as the B-distance from (H(s 1 ),H(s 2 ),...,H(s U )) to uniform, where B(x 1 ,x 2 ,...,x U ) is defined as follows: randomly generate r 1 ,r 2 ,...,r U and then run A(r 1 ,x 1 ,r 2 ,x 2 ,...,r U ,x U ). This is the success probability of a U-key attack against H, with almost the same cost as A.

The second distance is the same as the C-distance from (F(k 1 ),G(k 1 ),F(k 2 ),G(k 2 ),...,F(k U ),G(k U )) to uniform, where C(x 1 ,y 1 ,x 2 ,y 2 ,...,x U ,y U ) is defined as A(x 1 ,H(y 1 ),x 2 ,H(y 2 ),...,x U ,H(y U )). This is the success probability of a U-key attack against the pair (F,G), again with almost the same cost as A.

Regarding constructivity, it's reasonable to assume that there's a small algorithm for H; then a small algorithm that quickly prints A is easily converted into a small algorithm that quickly prints B and C.

To summarize, the success chance of a U-key attack against F-and-then-H-keyed-by-G is at most the success chance of a U-key attack against (F,G) plus the success chance of a U-key attack against H. This makes perfect sense, since U keys for F-and-then-H-keyed-by-G involve exactly U keys for (F,G) and exactly U keys for H.

I said at the beginning that I was considering only 1472 bytes of RNG output; but the generalization to H actually allows any number of bytes of RNG output. Take, for example, H to be F-and-then-H 2 -keyed-by-G (or more generally F 2 -and-then-H 2 -keyed-by-G 2 ) so that the RNG outputs F(k) and F(G(k)) and H 2 (G(G(k))). The success chance of an attack against this RNG is (by the theorem) at most the sum of success chances of U-key attacks against (F,G) and H; this is (by the theorem again) at most the sum of success chances of U-key attacks against (F,G), (F,G), and H 2 . Repeat for any desired maximum number of blocks. The generalization to H also handles the forward-security scenario: simply define H(k) as (F(k),G(k)).

As a concrete example, the success probability of an attack against 230 users of the fast-key-erasure RNG, each generating a chain of 230 keys, is at most the sum of 230 chances of similarly efficient 230-key attacks distinguishing (AES k (0),...,AES k (47)) from uniform. The best 230-key attack we know against (AES k (0),...,AES k (47)) is to

guess as many keys as we can, say 2 80 keys (success probability approximately 2 110 /2 256 since we're using AES-256); and, more importantly,

keys (success probability approximately 2 /2 since we're using AES-256); and, more importantly, check for collisions (success probability approximately 1/287).

Unless we can come up with a better attack against AES, we can't beat success probability approximately 1/257 for an attack against the fast-key-erasure RNG. Replacing AES-256 with Salsa20 or ChaCha20 improves 1/257 to 2140/2256.

Of course this should be stated as a formal theorem with clear definitions, and the proof should go through careful review: remember what I said about errors in proofs. But this feels like an easy textbook exercise. How can there be hundreds of pages of papers on this topic?

Generalized rekeying

There's one noticeable way that the RNG situation is simpler than the more general rekeying situation analyzed in, e.g., the "cascade" paper and the NMAC/HMAC papers. But the same proof technique turns out to work in the more general situation too.

Let's say F is a function mapping a 256-bit key k and a "block" x to a 256-bit output F(k,x). For example, the RNG defines F(k,x) as (AES k (x),AES k (x+1)), and the set of blocks is {0,2,4,6,8,...,46}. Exactly one of these blocks, namely block 0, is used by the RNG to produce a derived key.

For generalized rekeying, the set of blocks can be much larger, including any number of blocks used to produce derived keys. This is where the RNG situation is simpler.

Let's write X for the set of blocks used to produce derived keys. For each x in X, the cipher output F(k,x) is used as a derived key for another function H with inputs in a set Y, and the outputs of H are given to the attacker upon request. For each block x that isn't in X, the cipher output F(k,x) is simply given to the attacker upon request. Formally, define T(k,x,y) = H(F(k,x),y) for x in X and y in Y; and define T(k,x) = F(k,x) for any block x outside X.

U users can have many more than U derived keys, since X can have many elements. It's convenient here to switch to a more general notation for describing multi-key attacks: instead of having a number U of users, let's have a set U of strings that label users. There are three targets for multi-key attacks:

Choose an independent uniform random 256-bit key K(u) for each u in U. Define F K (u,x) as F(K(u),x) for each u in U and each block x. The goal of a multi-key attack against the cipher F is to distinguish this function F K from uniform.

(u,x) as F(K(u),x) for each u in U and each block x. The goal of a multi-key attack against the cipher F is to distinguish this function F from uniform. Define T K (u,x,y) = T(K(u),x,y) for each u in U, each x in X, and each y in Y; and T K (u,x) = T(K(u),x) for each u in U and each block x outside X. The goal of a multi-key attack against T is to distinguish this function T K from uniform.

(u,x,y) = T(K(u),x,y) for each u in U, each x in X, and each y in Y; and T (u,x) = T(K(u),x) for each u in U and each block x outside X. The goal of a multi-key attack against T is to distinguish this function T from uniform. Choose an independent uniform random 256-bit key J(u,x) for each u in U and each x in X. Define H J (u,x,y) = H(J(u,x),y) for each u in U, each x in X, and each y in Y. The goal of a multi-key attack against H is to distinguish this function H J from uniform. There's an expanded set of users here, namely all pairs (u,x), but this is still an example of the same multi-key attack concept.

Intuitively, there are exactly two ways to attack T: find a pattern in the F outputs, or find a pattern in the H outputs. The proof will follow exactly this intuition.

Decomposing multi-key generalized rekeying attacks into multi-key cipher attacks

Here's the proof.

As above, choose an independent uniform random 256-bit key K(u) for each u in U, and choose an independent uniform random 256-bit key J(u,x) for each u in U and each x in X. Also define M as follows: choose an independent uniform random 256-bit string M(u,x) for each u in U and each block x outside X; and define M(u,x,y) = H(J(u,x),y) for each u in U, x in X, and y in Y.

Define B(V) as A(W), where W is defined as follows: W(u,x,y) = V(u,x,y) for each u in U, x in X, and y in Y; W(u,x) = M(u,x) for each u in U and each block x outside X. Then the B-distance from H J to uniform is exactly the A-distance from M to uniform. Indeed:

If V = H J then W(u,x,y) = H J (u,x,y) = H(J(u,x),y) = M(u,x,y) and W(u,x) = M(u,x). Thus B(H J ) = B(V) = A(W) = A(M).

then W(u,x,y) = H (u,x,y) = H(J(u,x),y) = M(u,x,y) and W(u,x) = M(u,x). Thus B(H ) = B(V) = A(W) = A(M). If V is uniform then W is uniform. Thus B(uniform) = B(V) = A(W) = A(uniform).

Define C(V) as A(W), where W is defined as follows: W(u,x,y) = H(V(u,x),y) where each u in U, x in X, and y in Y; W(u,x) = V(u,x) for each u in U and each block x outside X. Then the C-distance from F K to uniform is exactly the A-distance from T K to M. Indeed:

If V = F K then W(u,x,y) = H(F K (u,x),y) = H(F(K(u),x),y) = T(K(u),x,y) = T K (u,x,y) and W(u,x) = F K (u,x) = F(K(u),x) = T(K(u),x) = T K (u,x). Thus C(F K ) = C(V) = A(W) = A(T K ).

then W(u,x,y) = H(F (u,x),y) = H(F(K(u),x),y) = T(K(u),x,y) = T (u,x,y) and W(u,x) = F (u,x) = F(K(u),x) = T(K(u),x) = T (u,x). Thus C(F ) = C(V) = A(W) = A(T ). If V is defined by V(u,x) = J(u,x) for x in X and V(u,x) = M(u,x) for x outside X, then W = M and V is uniform. Thus C(uniform) = C(V) = A(W) = A(M).

[2017.07.26 update: Corrected "B" typos in these two bullet items, which should have said "C". Remember what I said about errors in proofs?]

The A-distance from T K to uniform is at most the A-distance from M to uniform plus the A-distance from T K to M; i.e., the B-distance from H J to uniform plus the C-distance from F K to uniform. These are the success probabilities of multi-key attacks against H and F respectively, each attack having almost the same cost as A.

To avoid the cost of building and accessing an array of M values created on demand inside B, replace M with output from a high-security cipher (assuming one exists); this has negligible impact on the success probability of the attack. As for constructivity, again assume that there's a small algorithm for H; then a small algorithm that quickly prints A is easily converted into a small algorithm that quickly prints B and C.

That's it.

Version: This is version 2017.07.26 of the 20170723-random.html web page.