Of course, Bryukhanov is engaging in some self-preservation behavior here; by the end of this all, he’ll be on trial with the other leaders of the plant. The fear of being blamed and the need to blame others is intertwined.

In these early moments after the explosion, several people ask the question that becomes something of arc words for the series: “How does an RBMK reactor explode?” Asked here it’s to discredit people claiming that there’s been an explosion at all. It’s a weapon for the blameful postmortem culture, shutting down exploration of the problem because no one wants to be seen as stupid, even if it’s ignoring the clear truth (that a giant hole is in the place reactor 4 once was).

When it’s asked again later in the series, it’s a legitimate question, driven by the fact that the scientists investigating the incident don’t actually know how this kind of failure can happen. This is a failure that has never happened before or even been imagined before. Or…was it?

Debugging After The Fact

It’s December of 1983. Operators at an RBMK nuclear power reactor scheduled to come online by New Year’s Day on 1984 are conducting one of the final remaining safety tests to ensure that the plant is ready to be fully operational.

The Ignalina Nuclear Power Plant in Visaginas, Lithuania was essentially a sibling plant to Chernobyl. Both plants were using the new RBMK reactor design. Both plants were intended to be among the world’s largest and most powerful nuclear power plants ever constructed. Both plants were part of a plan for a new atomic golden era in the USSR. And both plants were scheduled to come online by 1984.

The RBMK reactor tube tops at Ignalina. Photo by Argonne National Laboratory.

At Ignalina Reactor 1, operators were preparing to SCRAM the reactor as part of the test. They’d removed a number of control rods and pushed the AZ-5 button to reinsert the rods and fully shutdown the reactor. But something peculiar happened. The reactivity went up. The power output spiked.

In this test Ignalina Reactor 1 didn’t have a dangerous number of removed control rods, and wasn’t running at low power for a day before the experiment. The power output went back down as the control rods fully inserted, and the plant shut down as intended. No disaster occured. Still, the operators knew that the spike in reactivity wasn’t just an equipment fluke. They reported it in their test results. As translated in the International Nuclear Safety Advisory Group’s 1992 report on the Chernobyl disaster (pdf), their report read in part:

When the reactor power decreases to 50% (for example, when one of the turbines is switched off), the reactivity margin is reduced as a result of poisoning…Triggering of the EPS [emergency protection system] in this case may lead to the introduction of positive reactivity. It seems likely that a more thorough analysis will reveal other dangerous situations.

They made a number of safety recommendations based on their observations, including conducting further investigation and that until that investigation was done, “the number of rods which may be fully withdrawn from the core (up to the upper limit stop switch) should be limited to 150 for the RBMK-1000 reactor.” Chernobyl Reactor 4 had nearly 200 fully withdrawn control rods on April 26, 1986.

It’s not quite that Chernobyl operators weren’t reading the postmortems put out by colleagues in other plants. The successful SCRAM caused the results in the report to be buried. The test was fine. The spikes were a fluke, probably a mis-measurement, and at any rate, reporting them widely would damage the reputation of the state-of-the-art RBMK reactor.

Now the good news for software engineers is that there’s generally not a state government censoring our access to Hacker News. We are thus allowed to read all the postmortems about how Panacea.js is actually severely flawed in spite of all the posts two years ago about how it was going to solve all our problems.

However, the human factors that lead us to disregard little incidents crop up all the time. When’s the last time you started trying to trace a problem causing a script to crash at seemingly arbitrary times after running for a couple days, couldn’t quite figure it out or force it to reproduce, needed the system to work, and just wrapped the service in a systemd unit and told it to restart automatically if it crashed (this is definitely not a thing I did last month why do you ask)?

This will not end poorly.

Getting things working now and ignoring systems that aren’t actively broken is a classic move in the software engineering playbook. When things (probably inevitably) explode from a lack of understanding the real bug, you’re likely to hear an engineer exclaim “oh yeaahh…” at some point during the incident.

Systems are complicated and “making them simpler,” while a noble goal, is easier said than done (like basically everything except floccinaucinihilipilification, the act of estimating something as worthless, which is easier done than said). Understanding the inherent complexity in our software systems also usually involves a disproportionate amount of understanding systems we didn’t build ourselves (Linux, of course, is entirely bug free, as are AWS hypervisors and Docker containers and Nginx servers and Javascript runtimes).

The understanding we have of a system affects how we operate it, so teams need time to debug and explore issues and an environment where “get it working right now” and “be able to debug the issue” are kept out of conflict as much as possible. Bryan Cantrill’s Debugging Under Fire talk goes into great detail on this, just watch it.

A much funnier, smarter, and more accurate talk on debugging than this article that just happens to also mention a different northeast blackout AND a different nuclear power plant failure.

Once we understand our system well and put a bunch of monitoring into it, we’ll have the details we need to observe the system live as the incident unfolds and understand it. If we do that, there’s nothing we can do wrong when it comes to debugging it, right?

The Data We Have and the Data We Need

The inside of a reactor core, especially a very large reactor core like the RBMK-1000, is hard to observe. Reactions happen on the order of tens or hundreds of nanoseconds, neutrons are moving somewhere between 2500 m/s and 20,000,000 m/s, and everything is a tiny subatomic particle.

Control rooms have a lot of high level information about what’s going on with the reactor; how much power the generator is producing, for instance, and which pumps are operating, and which control rods are inserted and how far. All this information lets operators make decisions like “we have a stuck valve, we should fall back to the backup pumps” or “there’s too much power rising too quickly, we should insert a few more control rods.”