I concur with most commenters on this thread in that ASIC mining hurts decentralisation. Unless, by some miracle, it happens that some place other than china is profitable for fabbing silicon it’s doubly dangerous because of how viciously the chinese government fights against decentralisation and open communication.

Several people mentioned Cuckoo Cycle, and I find it quite incredible how poorly informed most people are about it. Cuckoo Cycle was first devised TWO years ago. There is now Cuda and OpenCL implementations and it specifically is targeted towards rescuing the GPU mining community from ASICs. Though in my opinion his specific implementation leaves a lot to be desired in terms of flexibility for the network to non-hard-fork-adapt to new parameters that can make ASICs useless without obsoleting also much of the GPU deployed on the network.

Cuckoo Cycle’s 42 node cycle parameter is aimed at hitting a memory utilisation target of 4Gb of memory, which puts it on the margin where ASIC devices would have trouble competing with commodity hardware. In my opinion it should now be targeting 8Gb memory because in fact most mining hardware, anything bought in the last 12 months at least, is already at 8gb capacity and Equihash’s ~600Mb memory utilisation is a terrible waste of memory resources in light of this, and Cuckoo definitely appears to be targeted at the previous generation of video cards.

I may be looking too far into the future about this, but in my opinion, the only way to forever stop ASICs from creeping up on us is to lock the solver algorithm to CPU architecture, and this requires designing the algorithm to take advantage of what CPU has and not even GPU has. Two things:

Low cost, 16-64Gb memory, large 6Mb+ sized on-die caches. Admittedly an interesting and maybe ironic thing about this is that without doubt, AMD hardware has the total edge on this. Ryzen7 processors have 16Mb versus Intel 6-8Mb caches, AMD’s Hypertransport is 2-4x the speed of intel’s memory bus system.

But there can be even further scaling of capability with this. More and more systems now also have SSD devices on PCI-express buses that eclipse the performance of SDRAM circa 5 years ago. Mobile devices like smartphones and tablets now frequently sport a 4Mb CPU cache, at least 3Gb of memory and at least 16Gb of eMMC memory.

The ultimate solution for ASIC resistance is an algorithm design for the solver that leverages this large on-die cache, to minimise the amount of random access of memory in order to find solutions from the finite fields based on the randomly generated nonces.

I have been working on exactly a search and sort algorithm that I am calling BAST, an acronym for Bifurcation Array Search Tree.

This algorithm is, as far as I can tell from my research, the very first dense binary tree implementation. Let’s say we have a variant of Cuckoo cycle that works on a 128 bit hash, and instead of bit-fiddling non-8-divisible sized coordinates, we have adjacency lists, simply, we take our hash, cut it in two, and each represents the coordinate of the head and tail of a vector. For the theoretical total space requirement for an array consisting of 32 rows, we would with this consume 8Gb of memory for a theoretical filled 32bit indexed store, or 16Gb for the whole coordinate pairs together.

The big advantage of a dense storage representation is most visible for searches. Naturally a binary tree steps downwards and left/right, and always splits in two between the two halves. Each step down in a dense implementation of the store steps linearly through memory, with an approximate doubling of the distance traveled through the memory to perform the search.

The more complicated part, that I am working on, is how to minimise the both horizontal and vertical imbalance of the tree. It’s simpler than you might think, and most of the time also will do a lot of operations within a small section of memory.

Just to explain why this is important, conventional, sparse binary tree implementations will rapidly descend into random access hell. Not only that, each node would necessarily require at least 96 bits of extra data aside from the data contained in the node in order to reference it into the structure. In a search for a solution to a Cuckoo Cycle style loop, necessarily we have to have several bucket trees to work with. One for stuff that didn’t find its way into a candidate, one for stuff that has been rejected, and another, two part store with an unsorted, linked list of all the candidates that have been found, as well as a tree that lets us link candidate coordinates back to the cycles they form part of.

The memory overhead rapidly ascends to the point of ridiculous if you are going to use a binary search tree. For this reason, Cuckoo uses a bucket sort instead, but this introduces an issue that I think is very relevant to Proof of Work algorithms. If your process is spending an inordinate amount of time between searches, shuffling the nodes around, and then searching the nodes one-by-one, it should be obvious that we want to really have some of the advantages of a binary tree, in that we can skip, very rapidly, through parts of the list of elements, and quite probably, find the answer in less time, even considering the overhead of keeping balance during inserts and deletes.

Anyway, BAST is just the search and sort I am developing for my variant of Cuckoo Cycle that I am calling Hummingbird. It is intentionally aimed at CPU solvers but I think that it has relevance to any graph-theoretic Proof of Work algorithm anyway, and it will still have a performance and realtime edge over the ways that Tromp and his GPU solver wizards have come up with. I think that it may not be so onerous to deliver a potential means for GPU systems with multiple GPUs to participate, however, I do think that probably it would maybe restrict any real benefit in having the GPUs down to the bus width of connection, and thus 1 lane PCI-express would probably not be enough. By my calculations 4 lanes is the minimum to reach the speed of CPU/DDR4 memory performance, and it would also require the implementation of a memory access protocol to load data directly into the cpu cache accessed via PCI-express. I doubt that is complicated, however. Though at the same time I am dubious that it would be better than a Ryzen7 with 3200Mhz DDR4 memory.

The main points I have is that designers of PoW algorithms have not up to this point considered the possibility of scaling the resource requirements, cache sizes, bus speeds or memory capacities that continue to grow in commoditised general purpose computers. I think it would be good if at the same time somehow it was possible to allow previous gen and current gen and next gen video cards to be able to compete, but to somehow make sure that it’s a level playing field. I mean, most people mining right now are pretty much nearly at ROI on their gear, so it’s pretty distressing to hear that someone has found a way to stomp all over us and take our profits away.

That’s really what this is all about. We all invested early, and we are getting trashed by these johnny-come-latelies because of the lack of vision and imagination in PoW designers.