BlockCypher’s Ethereum API was knocked out for almost a month due to the Constantinople Hard Fork. This post-mortem explains what happened, lessons learned, and what we are doing to prevent this type of outage in the future.

Planning ahead for the Constantinople Hard Fork.

Ethereum announced their Constantinople Hard Fork (HF) in mid-December 2018 to occur on January 15, 2019. The Ethereum developers stated that the “upcoming fork to be the least eventful one in the history of Ethereum.” We disagreed: the HF impacted hundreds of source files. Following the protocol outlined by Ethereum, we proactively began following their instructions, which also involved modifying our back-up stores. We worked all during the Christmas holidays and completed the work the first week of January 2019.

We thought we were ready.

Photo from Bubba73

January 8: something is very wrong.

The night of January 8, we realized something was wrong with our Ethereum state but we did not know what: the only thing we knew is we were getting an error that some small piece of data was missing. The Ethereum state is inscrutable — all data is hashed in a tree — and it made it impossible for us to figure out what exactly was wrong.

We attempted multiple recovery procedures with no success. We kept getting a missing data error (a Trie node).

Having failed multiple times to discover and recover the missing data, we began the ‘Fast’ sync process: it took over 2 days for a “fast” sync to complete. Unfortunately, It did not help us restore the missing data, nor did it restore our state.

For everyone who asked these questions:

Why did a fast sync not work? Because it only includes a small subset of the whole blockchain data. To provide and operate our APIs reliably we need all of it. Why didn’t we make a back-up copy of our state before doing the Constantinople update? We did, but it was partially corrupted by the restore. Also the Ethereum state is not a database that can simply be backed-up and patched. It can’t be done while the Ethereum node is online, it can’t be done incrementally (and is well over a terabyte).

(Lesson Learned №1: the Ethereum state is very different that other blockchains. It cannot be restored using any traditional backup method.)

The long full-synchronization march began.

As a last resort, we began a ‘Full’ sync of the 2+Terabyte Ethereum state on January 12. Knowing the size we had to contend with, we upgraded to the biggest available machines in attempts to get the sync working faster. It barely made a difference. Compounding our problems — because there’s no transparency in the process — was the fact that we had no idea of our status in the upgrade and had no info in order to update our customers.

We were left helplessly waiting and checking.

On January 14 — the day before the Constantinople HF was scheduled to take effect — it was cancelled. Apparently a security audit found a vulnerability that could allow a potential attacker to steal cryptocurrency from a smart contract. The last minute cancellation was incredibly demoralizing to us. Had we waited to implement Constantinople until AFTER it took affect, we would have saved ourselves an incredible amount of work, angst, expense….and our ETH API would have been working the entire time.

(Lesson Learned №2: don’t plan ahead for Ethereum upgrades. Wait until they happen.)

Over 2 weeks later, we learned the ‘Full’ synch is not really a full-state restore.

After 2+ weeks, our Ethereum state was restored, but it was not the end of our woes. As it turned out, the full sync defaults to NOT including the full Trie state. If you are doing a full sync, why would a default setting NOT include everything? That defies logic. Our next challenge was figuring out how to add the Trie state into our ‘full’ state.

Vitalik, please help!

After examining every which way we could think of to add the Trie state to our Ethereum state, we asked Vitalik for assistance. His first comment to us was “oh you’re one of the few running one of those big, scary nodes.” We asked him if he knew of anyone else running a “big, scary node” to see if we could possibly sync with them. He knew of no one, not even the Ethereum Foundation keeps a full archival copy of the Ethereum chain. We were back to square 1: starting the Full sync again, this time including the Trie state.

(Lesson Learned №3: In the event of a chain re-organization, we may be the only ones to know the entire history of Ethereum transactions)

What are we doing to prevent this from happening again?

Only our ETH API was knocked out. All our other blockchain APIs continued to work as intended. It became clear to us that the Ethereum state is very different than any other blockchain: there is no recourse to roll back blocks, no straightforward state restore, and no method to back-up the state. As soon as we recovered, the question became: how can we prevent customer disruption going forward? The short answer is: from now on we will wait until after an Ethereum upgrade is in effect to upgrade backups and plan an outage ahead of time. The longer answer is we’ve put multiple redundancies in place to continuously duplicate staggered states so that we don’t ever (knock on wood) have to do a full sync again.

As we have seen many times, blockchains go through growing pains. Some are more painful than others and we learned a very painful lesson with Constantinople. When the HF actually happened on February 28, it was almost easy!