Date: Thu, 19 Dec 2013 06:08:53 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: scrypt parallelism vs. area-time vs. TMTO resistance Hi, Setting scrypt's p > 1 introduces parallelism at (almost) the highest level. This has the advantage of needing to synchronize the threads just once (before the final PBKDF2), but it results in much lower area-time cost for the attacker than we could achieve by having the parallelism at a slightly lower level, inside SMix. Specifically, in SMix's first loop, we may have the threads each write to its own portion of V, and in SMix's second loop we may have them all read from the entire V. This requires that we synchronize the threads just one more time: between the two loops in SMix. For classic scrypt, the running time is 2*N (assuming that we have enough computing resources to make full use of the parallelism, and omitting r from this discussion for now). For the revision proposed above, the running time is only 2*N/p: the first loop fills all N elements of V in N/p time, and the second loop performs a total of N lookups (as usual) and accesses 63% of V elements (as usual) in N/p time. Now given the same p and the same target running time, we can set N to be p times higher. If for classic scrypt the area-time cost of attack is O(old_N^2 * p * r^y) [1], then for the proposed revision it is O(new_N^2 * r^y). If we maintain the same running time, then new_N = old_N * p, giving the new area-time cost O(old_N^2 * p^2 * r^y). This is a major improvement: we went from p to p^2 for a factor in the attacker's cost, whereas the defender's cost remained the same. While p > 1 is currently most relevant to (offline) uses of the KDF, we may also look at what happens on authentication servers doing password hashing and needing to achieve a certain throughput. With classic scrypt, going from p = 1 (with N and r tuned to achieve the desired throughput) to p > 1 (with N decreased accordingly) weakens the hashes (in area-time terms) by a factor of p. With the proposed revision, this (theoretically) does not affect security, so the decision may be made based on other criteria (including possibly providing lower latency for the users by making full use of the available computing resources even when the server is under less than the maximum supported load). This may be most relevant for possible defensive use of GPUs and such, where the amount of parallelism available from concurrent authentication attempts may be way below what's needed to optimally utilize the device, even on busy authentication servers (need tens of thousands of concurrent threads for current GPUs, and this keeps growing). Note that this revision to scrypt does not affect its TMTO friendliness (it remains just as TMTO friendly as the original, thereby allowing for efficient computation of the same derived keys or hashes e.g. on smaller mobile devices, if desired). Sounds great so far? It does to me. Unfortunately, when we're trying to discourage TMTO (which we should when we don't intend to use it ourselves, as it gives extra flexibility to the attacker), this revision prevents us from using some of the possible approaches to that. Specifically, it is incompatible with any writes to V in SMix's second loop. Some workarounds to that are possible - e.g., split SMix's second loop into several, write only to small portions of V at a time, read only from the remainder of the entire V, synchronize the threads whenever there's a change in which portions are read-write (and local to their respective threads) and which are read-only (and shared between the threads) - but nevertheless the TMTO-discouraging effect is reduced and complexity is increased. Another minor drawback is that while with classic scrypt p is reusable as a way to control running time without increasing memory needs (to be used on small devices, where we can nevertheless wait a long time for a derived key to be produced), in the proposed revision a new parameter would need to be introduced for that - preferably a way to increase SMix second loop's iteration count arbitrarily beyond N/p. Arguably, as pointed out by CodesInChaos, classic scrypt should have had a parameter like this anyway, because increasing it provides better security than abusing p, due to increasing the portion of time spent in SMix's second loop (N^2 cost to attacker) vs. first loop (N^1.5 cost to attacker). Here are some excerpts from my current implementation/hack, made as a revision of crypto_scrypt-ref.c (without actually making use of the parallelism yet, which I am leaving for the optimized implementations): /** * smix(B, r, N, p, flags, V, NROM, VROM, XY): * Compute B = SMix_r(B, N). The input B must be 128rp bytes in length; the * temporary storage V must be 128rN bytes in length; the temporary storage * XY must be 256r bytes in length. The value N must be a power of 2. */ static void smix(uint8_t * B, size_t r, uint64_t N, uint32_t p, escrypt_flags_t flags, uint8_t * V, uint64_t NROM, const uint8_t * VROM, uint8_t * XY) { size_t s = 128 * r; uint64_t Vchunk = 0, Nchunk = N / p, Nloop; uint32_t i; Nloop = 0; if (flags & ESCRYPT_RW) Nloop = Nchunk / p; for (i = 0; i < p - 1; i++) { smix1(&B[i * s], r, Nchunk, flags, &V[Vchunk * s], NROM, VROM, XY); smix2(&B[i * s], r, p2floor(Nchunk), Nloop, flags, &V[Vchunk * s], NROM, VROM, XY); Vchunk += Nchunk; } smix1(&B[i * s], r, N - Vchunk, flags, &V[Vchunk * s], NROM, VROM, XY); smix2(&B[i * s], r, p2floor(N - Vchunk), Nloop, flags, &V[Vchunk * s], NROM, VROM, XY); for (i = 0; i < p; i++) smix2(&B[i * s], r, N, Nchunk - Nloop, flags & ~ESCRYPT_RW, V, NROM, VROM, XY); } (Please disregard NROM and VROM for now, they're irrelevant to the topic at hand.) The functions smix1() and smix2() implement equivalents of the original SMix's first and second loop, respectively. This code is actually still able to compute classic scrypt (when invoked with the appropriate parameter values, including the flags), matching scrypt test vectors, as well as the revision discussed here (and more). ESCRYPT_RW requests read-write accesses to V to discourage TMTO. When this flag is not set (like it is not set when computing classic scrypt or when the only deviation from classic scrypt is the new approach to parallelization), Nloop is 0, so the first two smix2() calls seen in the code fragment above are no-ops. When ESCRYPT_RW is set, there's some additional magic here, which ensures there's no reduction in "absolute" TMTO resistance when going from p=1 to p>1 and increasing N proportionally. Unfortunately, there is reduction in "relative" TMTO resistance (the amount of memory an attacker has to provide for efficient TMTO does not increase), as well as if p is increased without a corresponding increase of N. Because of this added tradeoff (higher area-time vs. TMTO resistance), as well as to continue to offer computation of classic scrypt as an option in this source tree for now, there's a flag that requests the new approach to parallelization. Here's a comment I wrote on the flags: * Classic scrypt is available by setting flags to ESCRYPT_WORM [...] [...] In this mode, the thread-local memory region (RAM) * is first sequentially written to and then randomly read from. This * algorithm is friendly towards time-memory tradeoffs, available both to * defenders (albeit not in this implementation) and to attackers. * * Setting ESCRYPT_RW adds extra random reads and writes to the thread-local * memory region (RAM), which makes time-memory tradeoffs less efficient. * This may be used to slow down the kinds of attackers who would otherwise * benefit from classic scrypt's efficient time-memory tradeoff. * * ESCRYPT_PARALLEL_SMIX moves parallelism that is present with p > 1 to a * different level (as compared to where it is in classic scrypt), which * results in higher relative area-time cost for the attacker (given the same * running time for the defender) and changes the way the running time is to be * controlled from N*r*p (for classic scrypt) to N*r (in this modification). * In the ESCRYPT_WORM mode, there are no serious drawbacks to this setting * except that it breaks backwards compatibility with classic scrypt. Thus, if * you want time-memory tradeoff friendliness (for possible defensive use at a * later time, e.g. on a small mobile device), but don't need compatibility * with classic scrypt, then set ESCRYPT_PARALLEL_SMIX. In the ESCRYPT_RW * mode, which is meant to discourage time-memory tradeoffs, this new approach * to parallelization makes the time-memory tradeoffs less inefficient (this is * a very unfortunate side-effect of avoiding some random writes, as we have to * in order to allow for parallel threads to access a common memory region * without synchronization overhead). Thus, in this mode this setting poses a * tradeoff of its own (higher area-time cost vs. better TMTO resistance). * All of this applies only when p > 1. For p = 1, this setting is a no-op. I'd appreciate any feedback. Alexander [1] The scrypt paper suggests that y=2 (in Conjecture 1 on page 12), but given the attack on SMix's first loop (and similar) pointed out by CodesInChaos and the similarity between SMix's first loop and BlockMix, I think it may actually be 1.5. This may be more intuitive if (for the sake of this thought experiment) we assume large r, small N (e.g., N=2), and an attacker's implementation that does not actually store V elements (extreme TMTO). Luckily, whether y is 2 or 1.5 or something else, it remains the same for all revisions discussed above, so does not affect any conclusions here.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.