c. Rewinding time is trivial with Planaria

There is a feature called “rewind” in Planaria Computer, a CLI tool for running Planaria nodes. It lets you rewind time back to a certain timestamp and recreate state from that point on.

pc rewind 520000

This lets you scrap all that’s happened on Planaria since block 520000 and restart crawling from 520000, as if nothing has happened. Simple as that.

Why The More Crash the Better

The crashes and failures of today are not only OK, but also desirable. I WANT more crashes and failures today to prepare for tomorrow.

In fact, I’ve explicitly designed Planaria so it FAILS and CRASHES in all unexpected edge cases.

This is because Planaria is not being built as some centralized API service company.

Planaria is being designed as an ENGINE that should be able to power any kind of decentralized swarm of fault tolerant APIs. And you can’t build a “fault tolerant system” without encountering as many faults as possible early on.

This is especially important because Planaria is not just a single application, but a generic framework that’s supposed to power ANY kind of app. It would be trivial to fix performance issues for Planaria if it were just a single type of API provider. But it’s not and it shouldn’t be treated as such.

Just a quick list of examples powered by Planaria:

Bitcoin explorer:

Photo sharing application:

Browser:

Virtual Computer:

File Server:

and more.

And as you can see, they are all completely different types of applications with various distinct usage patterns.

This is why Planaria fix/improvement should be carried out with a holistic mindset instead of just short-sighted quick patch to fix what’s immediately at hand. It’s all about making the framework itself resilient (not just a single API type). I want to encounter as many faults and as many bottlenecks as possible so i can generalize the problem and come up with the most universal solution that can solve the problem efficiently.

Just to be clear, it’s very easy to fix many problems Planaria nodes face today if all I cared about was one service uptime. For example, I could just increase the machine RAM and it would never crash. But that’s missing the point. Just increasing the memory may fix the immediate problem for that service endpoint, but it doesn’t fundamentally solve the problem for other future cases where Planaria needs to run on more memory-constrained environments.

So I intentionally keep the memory as low as possible for the public nodes I run, knowing very well that it will sometimes result in crashes when some unexpected transaction patterns occur on the network. It’s a tradeoff, but the benefit is much higher than the cost since recovery is easy.

Most public Planaria nodes at https://planaria.network all operate on a relatively constrained environment, somewhere around 1GB ~ 8GB of RAM (with some exceptional cases which uses 16GB).

Of course, all this assumes that eventually these problems will be fixed based on real world data. This approach has served Planaria very well so far, and it’s become much more efficient and performant than last year when I started out as BitDB, a precursor to Planaria.

Bugs Fixed + Lessons Learned + New Ideas

In this section, I will discuss a bug I discovered and fixed through the stress test, other lessons learned, and some cool ideas I came up with while thinking through all this.

1. [Fix] Memory Leak Fix

Through the crash logs I found that there was a memory leak happening with TXO, a low level transaction parser that powers Planaria.

What’s interesting is, this memory leak has normally been undetectable because garbage collection would kick in long before the problem ever became serious, and the leak itself would never happen in normal cases. But the specific transaction pattern from this stress test highlighted this problem and crashed the nodes.

I have finally found the issue (The bug was that some failed job queue actions were not exception handled and therefore kept accumulating in memory) and have fixed it. I can’t guarantee if this was the sole reason for the crash but I am sure that this was at least one of the problems.

So if you are running a Planaria node, please make sure to update to the latest version by running:

pc update

2. [Idea] Decouple Bitsocket from Bitquery

Reading through the log I realized that the Bitsocket ZMQ handling was taking up more memory than desirable.

I think one way to optimize this would be to decouple the query endpoint from the socket endpoint. This is still an abstract idea at this point, but is still a good start because at least now I know this will improve the performance significantly if a good solution is found.

Again, I couldn’t have come up with the idea if I didn’t take a closer look into what’s going on, which led me to pay more attention to Bitsocket, so I’m grateful that the stress test caused the crash.

3. [Idea] Re-org Multiverse API

It’s funny how many people whine about the reorg and think it’s some sort of “end of the world” event. This is how Bitcoin works by design!

Scaling requires pushing the boundary. You can stay behind and whine like a loser, or you can push the comfort zone to make progress. And it’s so great to see that Bitcoin SV is full of people who are pushing the comfort zone so hard. This is the ONLY way ahead, no shortcuts.

Also, I assume this will happen more often in the future as BSV keeps pushing the limits. And because I expect this to happen more often, I think this is a problem that needs solving. So I sat down to think about this problem. At first I was trying to come up with a re-org rewind API. Basically here’s how it would work:

Planaria could keep checking back N blocks to the past to detect any re-org occurrence When a re-org is detected, use the rewind feature to rewind back N blocks Problem solved!

But then I realized how the re-orgs are one of the most interesting and fundamental features of Bitcoin. Instead of fearing the re-org, you could embrace it and even capture the re-org events themselves and turn them into valuable information.

Basically I am thinking about adding a re-org API that can:

not only can rewind back to the valid checkpoint

but also can even generate various multiverse states created by these reorgs.

The API should be very flexible so developers can build all kinds of analytic tools or apps.

Imagine making use of re-org as a feature in apps. Exciting.

Conclusion

Did you know that Twitter used to go down all the time during its early days? They displayed this image of a whale whenever something went wrong with the server: