Ideas for a New Elliptic Curve Cryptography Library June 12th, 2015

Why Create a New Elliptic Curve Cryptography Library?

Commonly used crypto libraries like NSS and OpenSSL implement ECDH and ECDSA signing and verification for a large number of prime and non-prime curves. In order to support a wide variety of curves, these libraries internally have frameworks that abstract the general algorithms from the curve-specific details. These frameworks make sense when a wide variety of curves and algorithms need to be supported, but they add significant complexity that would be unneeded for software that only needs to deal with a small number of well-known curves, such as the NIST P-256 and P-384 curves and Curve25519.

OpenSSL exposes its internal framework as a public API for applications to use, so changing the framework risks breaking applications built on top of it. This makes it difficult to make improvements to the framework. Consequently, the implementations of curve-specific optimizations in OpenSSL generally avoid the lower-level parts of that framework, causing those implementations to have very similar but separate implementations of the generic logic. Thus, for example, one set of optimized implementations of the P-224 , P-256 , and P-521 curves in OpenSSL has ~1,700, ~2,300, and ~2,100 lines, respectively, of quite similar code.

With today’s C compilers, some key components need to be implemented in assembly language to give reasonable performance and to prevent side channel leaks of secret information. Assemblers usually have macros and other high-level features that make assembly-language programming convenient. However, some assemblers do not have such features, and the ones that do all have different syntaxes. OpenSSL has its own assembly language macro framework called perlasm to deal with this. Every OpenSSL assembly language source file is actually a Perl program that generates the assembly language file. The result is several large files of interleaved Perl and assembly language code. The fastest P-256 implementation in OpenSSL, ecp_nistz256, consists of ~6,000 lines of x64 perlasm + ~1,700 lines of x86 perlasm + ~1,500 lines of ARMv8 perlasm + ~1,700 lines of ARMv4 perlasm + ~1,500 lines of C code plus a ~150KB table of data.

Imagine that we must find a high-performance implementation of P-256 and P-384 (the two curves generally used in TLS) ECDSA signature verification and verify its correctness. We might not find a fast enough P-384 implementation, so we might be tempted to adapt an existing fast P-256 implementation for P-384 . How long would the audit and adaptation process take? How confident would we be in the result? How hard would it be for somebody else to meaningfully confirm our findings and review any changes we make? It seems likely that we would take a lot of time, end up with something we could not be very confident in, and find the result to be hard to improve later.

The fundamental problem isn't the sheer amount of code in the optimized implementations of ECC or that the techniques they use are bad. Rather, after the simple ideas used in the implementations were translated into low-level C and assembly language code, with multiple layers of repetition, it is now difficult to map the code back to the abstractions it was derived from. A small number of reasonable trade-offs of the ease of understanding by people for the ease of processing by computers accumulated quickly into a large and unreasonable such trade-off.

The specialized implementations of ECC in OpenSSL are optimized for ECDH key agreement and ECDSA signing, so they use constant-time algorithms to avoid timing attacks. Hoewver, ECDSA signature verification does not need to be constant time, and the variable-time WNAF scalar elliptic curve point algorithm is significantly more efficient than constant-time algorithms. The constant-time ecp_nistz256 code in OpenSSL is much faster than OpenSSL’s variable-time P-256 code, but WNAF-based variable-time point multiplication implemented using ecp_nistz256’s implementation of P-256 field operations should be significantly faster (on average) yet.

The IETF is currently standardizing new ECC cryptosystems based on two new curves, Curve25519 and Ed448-Goldilocks. It seems likely that Curve25519 will be critically important for a variety of reasons; for example, it seems like it will be a key part of the bridge connecting the new Internet of Things (IoT) to the normal Internet of Nothings. If we were to implement Curve25519 and Ed448-Goldilocks using the same strategies that have been used for P-256 in the past, we would end up compounding the maintenance problem we already face.

Curve25519 and EdDSA (or whatever signature scheme ultimately gets standardized) will take a while to become ubiquitous. This makes it tempting to try to use P-256 and ECDSA as part of the bridge between IoT and the rest of the Internet in the short term. To have even a chance of succeeding at that, such an implementation would have to be very efficient in terms of execution time and code size. It would also have to be easy to create efficient ports of the code to highly-constrained (slow, low memory, and power-constrained) architectures.

Goals

With those motivations in mind, it is worth exploring whether it makes sense to create a new ECC implementation with the following goals:

We must be confident in the correctness of what we created, preferably with the assistance of automated tools for correctness checking.

In high-end smartphones, tablets, and computers, P-256 and P-384 signature verification should take one millisecond or less, so that all the signature verifications needed for a TLS handshake can complete in under 10ms, so that the CPU cost of the TLS handshake is not a factor in terms of latency. On more constrained devices, we will have to settle for “as efficient as the code size budget allows,” for now.

and signature verification should take one millisecond or less, so that all the signature verifications needed for a TLS handshake can complete in under 10ms, so that the CPU cost of the TLS handshake is not a factor in terms of latency. On more constrained devices, we will have to settle for “as efficient as the code size budget allows,” for now. The code and any precomputed static data should be as compact as possible. In constrained applications the allowed code size is a parameter that can vary considerably, so it is hard to put a specific figure on this. However, informally we will aim to create something that is smaller than the ECC portions of other implementations like BoringSSL, OpenSSL, and NSS.

It should be straightforward to reuse large amounts of the implementation of P-256 and P-384 to implement Curve25519 and maybe other curves.

Without significantly sacrificing signature verification performance, the code for the arithmetic should be easy to reuse (preferably) or adapt (less preferably) for side-channel resistant implementation of signing and ECDH key agreement.

An Implementation Strategy

ECDSA signature verification consists of parsing inputs, doing an inversion and some multiplications modulo n (the group order), doing two scalar elliptic curve point multiplications, and then verifying that the result of the multiplication matches a value in the signature. We precompute a bunch of sums of the curve’s group generator point G and a bunch of sums of the public key, which requires multiplication, squaring, and inversion modulo q (the field modulus). We compute the WNAF representations of the scalars of the two points, and then perform a pattern of point doublings and point additions based on the pattern in the WNAF representation. The point doublings and additions are specific sequences of multiplications, additions, subtractions, and squarings modulo q . Multiplication and squaring modulo q are, by far, the most performance-sensitive operations because they are executed many times in a scalar point multiplication. In order to do all these arithmetic operations, we generally need to precompute some quantities that are functions of n or q and the word size of the machine. ECDSA signing and ECDH key agreement more-or-less involve the same kinds of operations, except they need everything to be constant time.

One way to minimize code size, at the cost of speed, is to implement all the algorithms generically. For example, there would be one modular multiplication function that takes two multiplicands and the modulus as inputs, one modular inversion function that takes the element to invert and the modulus as input, etc. We would probably find that such generic implementations work good enough in many cases, especially the operations modulo n that are not frequently executed, so that we can avoid writing curve-specific code for many of them. If we start with generic implementations of every algorithm, then we can use them as the foundation for a data-driven code-profile-optimize development loop that ensures that we are always optimizing the most impactful part of the implementation, without guessing. Having the generic implementations of everything available would also give us flexibility in making size vs. speed tradeoffs later in the development process.

Besides trying to share code across curves, we could also minimize object code size by moving work from runtime to build time. For example, instead of parsing the string representation of q into an array of limbs at runtime, we parse the value of q during the build and generate a static const array definition that contains the array. This way, we save not only the cost of the function call to the parsing function, but we can also avoid the cost of the code for the parsing function. More generally, we should try to do everything that can be done at build time at build time.

Also, we should avoid doing things manually that can be done automatically by the build system. For example, in order to do Montgomery multiplication, we have to calculate and save the values the constants k and r 2. Rather than calculate the values manually and hard-coding them into the program or code generator, the code generator should calculate these values from the basic inputs. This would end up having a much better documentation effect than hard-coding the values because it is guaranteed that the formulas and the results match. Further, adding this small extra complexity to the code generator would make it much easier to expand the implementation to support additional curves and platforms.

An ECC Compiler

We can take the ideas in the preceding section further and think about the code generator as a compiler for an ECC domain-specific language (DSL). The terms of the language would be field elements and scalars, its statements would be a series of assignments of the results of field operations (additions, subtractions, multiplications, and squarings), and its runtime library would consist of the curve-agnostic implementations of the various algorithms plus any curve-specific functions that we hand code.

There are already several programs written in this language in the Explicit-Formulas Database. For example, here is a program from the Explicit-Formulas Database which, given a point (X1, Y1, Z1) in Jacobian coordinates, calculates the double of that point (X3, Y3, Z3), alongside a very simple translation to a C function that calls other functions that perform the actual field operations:

delta = Z1^2 gamma = Y1^2 beta = X1*gamma t0 = X1-delta t1 = X1+delta t2 = t0*t1 alpha = 3*t2 t3 = alpha^2 t4 = 8*beta X3 = t3-t4 t5 = Y1+Z1 t6 = t5^2 t7 = t6-gamma Z3 = t7-delta t8 = 4*beta t9 = t8-X3 t10 = gamma^2 t11 = 8*t10 t12 = alpha*t9 Y3 = t12-t11 /* impl_point_double doubles the point |a|, * storing the result in |r|. */ static void impl_point_double( JacobianPoint *r, const JacobianPoint *a) { /* Variable declarations and handling * of points at infinity elided. */ impl_sqr_mont_q(delta, a->z.limbs); impl_sqr_mont_q(gamma, a->y.limbs); impl_mul_mont_q(beta, a->x.limbs, gamma); impl_sub_mod_q(t0, a->x.limbs, delta); impl_add_mod_q(t1, a->x.limbs, delta); impl_mul_mont_q(t2, t0, t1); impl_mul_3_mont_q(alpha, t2); impl_sqr_mont_q(t3, alpha); impl_mul_8_mont_q(t4, beta); impl_sub_mod_q(r->x.limbs, t3, t4); impl_add_mod_q(t5, a->y.limbs, a->z.limbs); impl_sqr_mont_q(t6, t5); impl_sub_mod_q(t7, t6, gamma); impl_sub_mod_q(r->z.limbs, t7, delta); impl_mul_4_mont_q(t8, beta); impl_sub_mod_q(t9, t8, r->x.limbs); impl_sqr_mont_q(t10, gamma); impl_mul_8_mont_q(t11, t10); impl_mul_mont_q(t12, alpha, t9); impl_sub_mod_q(r->y.limbs, t12, t11); }

By implementing a compiler, all the classic optimization techniques used in compilers would be available to us. For example, notice in the C code above that all the multiplications by constants were translated to calls of functions that are optimized for multiplication by that constant; this is a basic form of classic strength reduction. As another example, the compiler could reorder the statements in order to optimize the program for the out-of-order execution behavior of a particular processor. In particular, the compiler could put effort into ensuring, whenever possible, that a statement that stores a value into a variable x never directly precedes a statement that reads the value of x . Or, if our compiler were to generate assembly language code, then we might implement a custom calling convention where the result of a calculation is always stored in a specific set of registers and where the inputs to a function are always stored in those same registers, to avoid otherwise-unnecessary load/stores. Our compiler might even transform the program into continuation passing style (replacing ret + call with jmp ) or do other things that would be easy to for a compiler do, but would be error-prone to do (or verify) correctly by hand.

More interestingly, our compiler could go beyond classic compiler techniques. It seems hard to justify the amount of effort that would go into building a sophisticated optimizing compiler for our language. Instead, we might implement a dumb randomized compiler that, given a seed value, configures flags that control various simple optimizations like reordering and various attempts at strength reduction. Then we could operate our compiler in two modes: profiling mode and production build mode. The profiling mode would generate a random seed, generate the code, compile the code, run a program that measures the performance of the code, and compares the execution time to the previously-known best execution time. Then we would just let it run until we run out of development time, take the best seed the program output, update the build system so that the new seed is used for production builds, and say we’re done. We could call our ridiculous compiler the “super dumb compiler” and hope that people confuse it for a superoptimizing compiler. To scale this incredibly inefficient optimization strategy, we could mint a new cryptocurrency, Superdumbcompilercoin, where we trick people into running the super dumb compiler on their computers, cross-certifying each others’ results, and sending us increasingly better optimization seeds in return for plausible delusions of becoming rich.

Besides generic compiler and superoptimization techniques that apply to all languages, the compiler could also assist in the implementation of ECC-specific optimizations. For example, it could discover, or at least assist in the implementation of, cases where we can coalesce reductions modulo q after each field operation. As another example, the compiler could choose completely different formulas based on the cost of multiplications vs. the cost of squarings and/or the cost of multiplications vs. the cost of inversions.

The more complex the compiler becomes, the more important it would be to automate the verification of the validity of the optimizations. We could do this by splitting our compiler into two parts: the optimizer and the back-end. The optimizer would take programs in the three-operand statement language as input, and output a optimized form of the program in a similar, lower-level, intermediate language. The back-end would then transform this intermediate language into assembly language (and/or C) code. If we design the intermediate language carefully, we should be able to make it easy for proof checkers or computer algebra systems to process it and verify that the intermediate program is correct, much more easily than if we were to try to formally prove that equivalent C or assembly language code is correct. It would be possible that the back-end could have a flaw that causes it to generate bad code even for good inputs, but we can minimize the risk of that happening by keeping our back-end simple; eventually we could try to formally prove the compiler’s correctness.

Optimized Implementation of Field Operations

The implementations of the P-256 field operations in OpenSSL’s ecp_nistz256 are very impressive. They are very fast and are based on smart ideas. However, I have found the code—11,000 lines of Perl mixed with assembly language—very difficult to audit. Although it is a lot of code, it consists of a few basic constructs repeated many times. It would be great to find a way to use implement the same techniques as the ecp_nistz256 code, but in a way that is adaptable to other curves and that is easier to understand.

We could implement another code generator to generate optimized implementations of field operations from more abstract, but still low-level, expressions of them, like qhasm does. However, it seems likely that we will find that optimal implementations of field operations for different architectures and sub-architectures are going to be so different that it would not be worthwhile to find a way to abstract away architecture-specific bits into portable code. Instead, we might pick a single cross-platform assembler, like Yasm, and try to express the abstract patterns primarily using its macro language, hoping that the consistency of using the same macro language would make it easier to maintain the various platform-specific implementations. Assembler macros work quite similarly to how C++’s templates work (particularly how C++ templates worked before C++11) and so are surprisingly powerful. It seems like we would need to make at least a few implementations in “native” assembler before we could understand the costs and benefits of such abstractions.

There are aspects of the implementation of field operations that cannot be solved using a macro assembler in a way that is easy to understand. For example, when implementing Montgomery multiplication we need to multiply a single limb by the field modulus, and the instruction sequence—not just the operands of the operations, but the sequence of operations itself—depends on the value of the modulus. Again, it would not be clear how automating the generation of such code would help us until a few implementations have been made by hand.

One of the major design issues is the way the large integers are split into limbs. For example, should a 32-bit implementation use full 32-bit limbs to minimize the number of multiplications, partially-filled 29-bit limbs to minimize adds-with-carry, or something else? I suspect it will be best to use full limbs in most constrained implementations because the cost of multiplication will be high relative to addition-with-carry, and they do not try to be so clever with respect to out-of-order execution. Again, it is hard to say without implementing some prototypes.

Dynamic Compilation and Self-Modifying Code

On x86, the most efficient way to implement modular multiplication depends on whether the processor implements SSE2, MULX, and/or the ADCX/ADOX instruction set extensions and/or the processor family/generation (Atom vs. Core i7, Sandy Bridge vs. Haswell, etc.). Similarly on ARM the best implementation depends on what instructions (NEON, in particular) are available and whether the chip was designed for in-order or out-of-order execution. Shipping a binary with every curve optimized in each possible way would result in a lot of bloat. It may be better, instead, to ship the compiler as part of the library, and then have the compiler generate just the one implementation of each curve that is optimized for the processor that it is running on. It would likely be hard to make the compiler small enough for this to be a win, though.

On a constrained device, one could generate the executable code for a curve, run the code, and then throw the generated code away to make room for other code (e.g. the generated implementation of another curve). Or, perhaps we could ensure that the code generated for each different curve is similar enough that we could implement an efficient patching mechanism that would allow us to patch the implementation of one curve to morph it into the implementation of another curve, and then use another patch to get back to the implementation of the initial curve. Of course, this would be complete crazy unless it were automated by the compiler (and maybe even then).

Smart Caching of Intermediate Values

Although the cost/benefit ratio of a runtime compiler is questionable, there are many more straightforward optimizations we can do, based on the same idea of being flexible on how we distribute work between the compiler and the runtime system. Normally we need to precompute a bunch of additions of the curve’s group generator. If we were to do that at build time, then we would bloat the size of the executable by a significant amount. (In ecp_nistz256, the size of the static table is about 150KB; in libsecp256k1 that table would be over a megabyte.) If we were to do that at runtime, then we would need to do the precomputation at a time where we can afford to do compute a multiplicative inverse modulo q , which is quite expensive. Instead of doing either one of those, we could precompute just that inverse at build time and then do the rest of the precomputation at runtime.

A similar consideration applies to the precomputation of points related to public keys; if we have a public key cache, we should consider having that cache store the result of the inversion that is needed for the precomputation.

More generally, higher-level code that makes use of ECC primitives is often not very efficient about caching intermediate results. For example, when verifying a TLS server certificate chain, it is common to need to use each P-384 certificate public key twice: once to verify the signature of a certificate, and again to verify the signature on an OCSP response for that certificate. It may be quite worthwhile to cache all the precomputed points for the last few public keys used during certificate validation. In particular, if we know that we are going to use a given public key twice then we should precompute more points than if we expect to only use the public key once.

The Dangers of Error Checking

When verifying an ECDSA signature, we have to validate the inputs are within acceptable ranges. We could do this range checking when we parse them, or we could do the range checking before the computations that depend on the values being within range, or we could implement redundant checks. It is natural to expect that more error checking is better, but OpenSSL demonstrates that bugs frequently lie in error handling code paths. Also, when the same error condition is checked multiple times, the first check usually makes the redundant checks dead code, so we would have to leave the code implementing the redundant checks untested or we would have to jump through hoops to write tests that reach the secondary checks.

The special case of failure to allocate memory (e.g. malloc returning NULL ) is particularly terrible, because this kind of edge case is almost never thoroughly tested. Frequently developers on open source projects have significant apathy for handling and testing these cases because they have configured their own applications to abort on memory allocation failure so any bugs in the error handling will not affect them. This apathy leaves everybody else that does not use the same strategy at risk.

It seems it would be better, then, to do all range checking, memory allocation, and other possibly-failing actions as early as possible, before we start calculating any results. Then all of our low-level calculation functions would become infallible. And then our mid-level calculation functions would become infallible, since they would not have to worry about error cases from the low-level functions. This pattern would result in a recursive simplification up to the top-level functions. It should also make the compiler simpler because the compiler would not have to generate any code to do error checking, and it should make our correctness proving tasks easier since there would be many fewer code paths to prove correct.

Since all the low-level functions would have important preconditions that they do not check, it would be critically important that those preconditions be clearly documented. The C assert function could be used for this. assert ’s primary advantage as a documnetation tool is that it can be machined checked to a certain extent so that it is less likely to get out of sync with the code, especially with good test coverage.

What about the Algorithms?

It seems silly to build an implementation that uses a bunch of crazy optimization techniques and hard-coded assembly-language to implement obsolete algorithms. However, a little experimentation has shown that ecp_nistz256 is implemented so efficiently that it beats pretty much any implementation of better algorithms that is not very carefully coded. Choosing and implementing better algorithms is much easier and has less risk of failure than the process of figuring out all the optimization techniques that would be needed regardless of what algorithms are chosen. The advantage of implementing the low-level optimizations first is that we could benchmark against ecp_nistz256 in an apples-vs-apples manner, so that we could measure the benefits of the low-level optimizations distinctly from the benefits of the improved algorithms. The compiler would help in defering the choice of algorithms until later, because it would make it make it easy to switch from commonly-used algorithms to new ones.

In any case, a description of the best algorithms to use, and why they are the best, is worth considering seperately. For now the main points to keep in mind with respect to algorithm choices are: (1) There are algorithms that are not commonly used that are seem to be significantly better than the ones used in the most commonly-used open source software, and (2) in order to realize the gains from those algorithmic improvements, the implementations of the better algorithms have to be very good.

Conclusion

It is plausible that a DSL-based approach to implementing an ECC library would have significant advantages with respect to attaining strong assurance of correctness, ease of maintenance, and optimal performance. This is particularly likely to be important for constrained platforms where are limited in our ability to trade space for time and where we cannot rely on C compilers to generate efficient enough code. It seems worth trying to build a prototype to see if such an approach makes sense.

Recommended Reading