Update, Mar 6, 2020: While research continues into the viability of Nano PoW and the potential application of other algorithms, including Equihash, we have announced plans to optimize the tuning of the current algorithm for Athena V21.

Nano PoW is an authentication-free rate-limiting algorithm, commonly referred to as a proof-of-work. Rate limiting after a counterparty has been authenticated is trivial — this is commonly seen in login-attempt limiters and user quotas for bandwidth, storage, or time. It is, however, much more difficult to rate limit in a situation where authentication has not yet, or cannot be performed.

The first uses of proof-of-work automated rate limiters were email spam prevention tools. With email, large amounts of unsolicited email can be sent at very low cost and can easily inundate an individual’s email box with undesired communications. An obvious option is to use an email sender address whitelist, though this is cumbersome for a user to maintain, and users tend to desire a low rate of unsolicited messages. In place of this whitelist, it was proposed for the sender to attach a small piece of data called a proof. This proof would mathematically demonstrate the sender dedicated more than a trivial amount of resources to make this transmission.

Following a similar pattern to early proposals for email, the Nano network has required proofs on every transaction to date using a blake2b-based algorithm. Although the original design of requiring a proof is still useful, the existing algorithm proved to be too easily parallelized and needed a new approach to serve its intended function. The design of this new Nano PoW algorithm is outlined in detail below and although it is useful in a broader range of applications, the first implementation will be done on the Nano network.

Design goals

Authentication-free : A proof must be able to be constructed in the absence of authentication.

: A proof must be able to be constructed in the absence of authentication. Minimize proof size : Proof should be as small as possible: 8–16 bytes, in order to minimize space overhead when presenting the proof.

: Proof should be as small as possible: 8–16 bytes, in order to minimize space overhead when presenting the proof. Minimize proof verification resources : Verification should take minimal computing time and resources.

: Verification should take minimal computing time and resources. Simple specification : Simpler solutions are easier to analyze and implement.

: Simpler solutions are easier to analyze and implement. Maximize generation time-area-product : The algorithm should require a certain amount of time and hardware resources to be present in order to be efficiently solved.

: The algorithm should require a certain amount of time and hardware resources to be present in order to be efficiently solved. Minimize power consumption: The method of proving should use as little power as possible.

Algorithm details

The proof-of-work is a 2 number solution (x,y), each of N-bits, which are a solution to the equation H0(x) + H1(y)=0 mod 2ᴰ, where D is a tuned difficulty parameter and H0, H1are preimage resistant hash functions keyed by a problem nonce P.

Nano PoW equation that valid proofs must satisfy

As an example, if H0 and H1 produce 8-bit outputs and we choose a difficulty D=4 we have:

Hash Functions H0 and H1: Two different hash functions are chosen in order to eliminate solution finding optimization which we elaborate on later. These hash functions can be constructed using a single algorithm-family simply by initializing the hash function with different keys. For instance, H0 can be keyed by always setting a fixed bit in P and H1 can be keyed by clearing the same bit.

Plus operator: An invertible operator chosen to also to eliminate solution finding optimizations.

0 mod 2ᴰ: Higher values of D are a higher difficulty and it is tuned to the difficulty threshold required by the user.

Solution finding methods

The naive way to find a solution is simply by trying random values for x and y. A processor picks a value for x and iterates y until a solution is found. This can be viewed as storing 1 candidate x value and each time we attempt a y value, we’re checking this attempt against a single candidate.

An improvement on this can be made by storing more than one x candidate and comparing a y attempt against all stored x candidates. The key to this optimization lies in the ability to compare a single y attempt against all stored x candidates in constant time. We’re able to do this by radix-sorting x candidates into buckets according to least significant bits of H0(x). When we’re making a y attempt, we can calculate the unique bucket which might contain a solution by rewriting the solution equation to H0(x)=0-H1(y) mod 2ᴰ and we know the only place a candidate solution can be is in the bucket LSB(0-H1(y)).

If the table is filled with M values, we’re able to compare a single y attempt against M x candidates with a single, constant time computation and a memory lookup. This is the lookup table’s M-factor.

Finding a solution in optimal time involves finding a balance between populating the table with candidates, and making attempts.

Filling the lookup table

Since H0, H1 are required to be preimage-resistant functions, there’s no way to pick and fill a particular bucket entry — we must fill the table randomly and stop at a statistically optimal count. As the table load factor increases, the collision chance increases as well and additional filling has rapidly diminishing returns.

Searching the lookup table

The solution finding algorithm must be designed in a way to deal with uninitialized data since it is not required, nor desired from a performance perspective, to fill the entire table. We continue checking new attempts against stored candidates until a solution is found.

The optimal solution-finding method consists of two steps: randomly filling a lookup table, and efficiently searching for potential solutions

Sizing the lookup table

The optimal size of the lookup table is a function of the problem difficulty, factoring in the diminishing returns from a high table load factor. Each fill requires an H0 computation and a memory write. Each attempt requires one H1 hash, a memory read, and another H0 hash in order to reconstitute the full hash of the value retrieved. We’ve found that a lookup table of approximately 2^(D/2) entries gives the optimal solution finding time.

M-factor and fill count ratios for a given table size

Design considerations

Simple hash algorithm : We want to maximize the memory access overhead compared to the compute time overhead. Therefore, we are looking for the simplest hash function that both produces uniformly random output and is also preimage resistant. Siphash is extremely easy to implement and provides both of the guarantees we need.

: We want to maximize the memory access overhead compared to the compute time overhead. Therefore, we are looking for the simplest hash function that both produces uniformly random output and is also preimage resistant. Siphash is extremely easy to implement and provides both of the guarantees we need. Lookup versus hash table : Since the key being stored is already uniformly random there’s no need to calculate an additional hashcode for the key, we can use the value itself. Since we can reconstitute the full key from a value in a particular bucket, we don’t need to store the full key value at all.

: Since the key being stored is already uniformly random there’s no need to calculate an additional hashcode for the key, we can use the value itself. Since we can reconstitute the full key from a value in a particular bucket, we don’t need to store the full key value at all. High load factor : To make the best use of allocated memory we want to maximize the table’s load factor while taking into consideration the diminishing returns from filling as collisions increase.

: To make the best use of allocated memory we want to maximize the table’s load factor while taking into consideration the diminishing returns from filling as collisions increase. Junk and data-race immune: Since there’s no guarantee that any particular bucket will contain an initialized value, the algorithm is designed to proceed even in the face of junk data, data races, or data glitches. This means the entire algorithm can proceed without thread synchronization or atomic memory operations in any phase.

Parallel Rho searching

There are similarities between the Nano PoW algorithm and finding a hash collision. Hash collisions of lengths less than what’s cryptographically secure can be found using very limited memory with Rho searching¹. We add two parts to the solution equation, both of which independently make the solution unable to be found by Rho searching.

Collision finding using Rho is done in the following way: the output of a hash function is fed back into its input to form a path on which a particular value follows as successive outputs are fed back as inputs. Rho searching relies on these two paths eventually merging into a single path when the outputs from two different inputs happen to be identical. Once the two paths merge they will never diverge. Eventually the paths arrive at a distinguished point at which point the algorithm has found a hash collision between the current and previous distinguished points.

The first thing we add is using two different hash functions on either side of the operator. When we use two different hash functions, even if a collision was found between the two values, they would not proceed to follow identical paths and thus the collision can’t be detected by distinguished points.

The second thing we add is using the addition operator instead of an equality comparison. Rho searching uses path merging as a signal that the solution was found and path merging only works when looking for value equality. If we use addition instead of equality comparison, merged paths are not an indication of a solution and thus Rho cannot be applied.

Scaling by trailing zeros versus operand count

Our chosen method of increasing problem difficulty is by requiring a solution to have more trailing zeros. More trailing zeros will require more attempts and thus, require longer solution bit-strings. We did consider an alternative way of scaling up the problem difficulty by requiring additional operands to be added together. The relevant quantity to compare is how much additional memory is required given additional bits of solution size.

For approximately every two trailing zero bits, our memory requirements double in size. This means that if we double the number of trailing zeros required, we square the amount of memory required. In contrast, if we double the number of operands required, we only double the amount of memory required.

Increasing problem difficulty by requiring additional trailing zeros scales the memory requirements quadratically with respect to solution length, however, increasing problem difficulty by requiring additional operands only scales linearly.

Memory versus logic elements

Both memory and compute elements represent a physical quantity associated with discovering the solution. Memory elements, however, are more power-efficient than compute elements. The majority of power dissipated in digital logic is from switching losses as logic elements transition between their high and low states. Memory gates hold their values steady and don’t dissipate as much power per element. Memory is highly commoditized, even between competing processor implementations and is the lowest $/transistor on the market.²

No data relation between items in the lookup table

All items in the lookup table are outputs of a psuedorandom function and have no determinable data-relation to each other. This property eliminates the possibility of time-memory tradeoffs similar to scrypt³ where data storage can be traded off for computation based on data relations within the table. Additionally, because there is no spatial or temporal relationship between candidate solutions in the lookup table and attempts, data caching will have a low hit rate.

Observed performance

The table below lists the performance, in mean time to generate and validate a proof, for selected CPUs and GPUs. Devices use the optimal number of threads (T), and 2MB pages (where possible). All numbers were obtained for difficulty 60. The memory used refers to the size of the lookup table — 4GB corresponds to lookup 30, 8GB to lookup 31. Results are blank where the required memory exceeds the available memory. Validations are only performed by CPUs.

Nano network implementation

The first implementation of the Nano PoW algorithm has been made public for review and feedback in the following GitHub repository:

Plans for adding this implementation to the Nano node with V20 Lydia are underway. If you are interested in helping test performance and provide feedback on Nano PoW please join us at https://chat.nano.org in the #pow channel.

To stay updated with the latest on these efforts, make sure to follow the GitHub repository above, the Nano Medium Blog and join us on our social channels.