In the cockpit, the pilots noticed the effects of the bad data within seconds of its creation. First, the autopilot disconnected as it proved unable to reconcile the differences in the data it was receiving from the three ADIRUs. Captain Sullivan immediately announced that he had manual control. Less than five seconds later, the pilots found themselves bombarded by a sudden cascade of warnings triggered by the mislabeled and corrupted data. Fault messages flooded onto the computer screen in the central console, and the “stall” and “overspeed” warnings both started going off intermittently — an obviously impossible combination, considering that one indicated they were flying too slow and the other indicated they were flying too fast!

Captain Sullivan tried engaging the A330’s second, backup autopilot. At the same time, the airspeed and altitude values on Sullivan’s flight display, which sources its data from ADIRU 1, appeared to go haywire, fluctuating wildly in a manner completely inconsistent with the aircraft’s level and docile trajectory. A fault message and warning light associated with the number one inertial reference unit (part of ADIRU 1) also went off. In response to the unreliable airspeed indications, Sullivan switched the autopilot back off and flew the plane manually using the standby instruments on the center console. Utterly baffled by the cascade of apparently false warnings, Captain Sullivan and Second Officer Hales called First Officer Lipsett back to the cockpit to help figure out what was going on.

But before Lipsett made it to the cockpit, the sequence of events unfolding in the realm of information suddenly broke through into the real world. A spike of altitude data mislabeled as AOA data and marked as valid by the flight computer triggered two separate emergency conditions of the A330’s so-called alpha floor protections. Alpha floor protections, a central part of Airbus’ design philosophy, are limits imposed on the pitch, angle of attack, airspeed, and bank angle that will trigger automatic corrective actions when exceeded. These protections normally prevent pilots from making control inputs that could put the plane into a dangerous attitude, and correct a dangerous attitude if one develops. But the faulty data incorrectly triggered two of the alpha floor protections even though the aircraft was in a normal attitude for cruise flight. A system called “high AOA protection” detected an excessively high angle of attack (sourced from the faulty ADIRU 1) and applied a 4-degree nose down elevator input, the maximum it could command, to help bring the AOA back within limits. At exactly the same time, the same bad data triggered a separate system called “anti-pitch up compensation” that is intended to counteract the A330’s tendency to pitch up when flying at a high speed and high angle of attack. This system applied a 6-degree nose down elevator input, which also happened to be the maximum it could command. The two nose-down commands were additive, together applying a sudden 10-degree nose down elevator movement.

The two control mechanisms triggered by the bad AOA data. Image source: the ATSB

The effect of a 10-degree nose down command while in cruise flight was sudden and catastrophic. The plane entered an immediate dive, flinging into the ceiling anyone and anything that wasn’t tied down. At least 60 seated passengers weren’t wearing their seat belts, and the negative G-forces slammed them head-first into the passenger service units on the bottom of the overhead bins. Several others, including most of the crew and some 20 passengers, were out of their seats carrying out various duties or making their way to the toilets. They too found themselves thrown against the ceiling with great force. Luggage compartments burst open, spilling suitcases and backpacks into the aisles. Drinks, food, laptops, books, and other loose items flew in every direction.

Reenactment and simulation of the dive. Video source: Mayday

In the cockpit, the pilots were pulled up and out of their seats, restrained only by their lap belts. Captain Sullivan reached for his side stick to pull the aircraft out of the dive, but when he tried to bring the nose up, there was no response; the automatic systems had locked him out. He let go and then tried again. This time, because the data spike was over, the elevators responded and the plane started to level out.

As the negative G-forces subsided, everyone in the cabin who was pinned to the ceiling came crashing back down again. People slammed into the floor, the seats, and other passengers, falling back down amid a chaotic flurry of random objects. Still recovering from the shock of the upset, passengers and crew alike took stock of the situation. The violent maneuver had caused widespread injuries — there were broken bones, concussions, serious lacerations, and more. All of the flight attendants were injured to various degrees. One person broke a leg, several suffered serious spinal injuries, and many were bleeding profusely. First Officer Lipsett, who had been on his way to the cockpit, broke his nose.

Now back in control, Sullivan and Hales, who were not hurt, set about trying to clear all the error messages on the computer screen. The fault notifications affected a wide variety of systems, and many of them required no action, but the one that kept coming up again no matter what they did was the same “NAV IR 1” fault that they received earlier. And as they worked, stall and overspeed warnings continued to blare. Second Officer Hales made an announcement over the public address system calling for all passengers and crew to sit down and fasten their seat belts immediately.

Suddenly, another spike of bad AOA data made it through to the flight computer. Although the disconnection of the autopilot had changed the alpha floor logic, removing the high AOA protection, the anti-pitch up compensation system remained active and was triggered again. This time the dive wasn’t as steep and most people had fastened their seat belts, but some who had been injured or were trying to help others had not, and they were thrown into the ceiling again. Just like the first time, Sullivan’s initial efforts to pull up had no effect; and just like the first time, the resistance abated after several seconds and he was able to level the plane.

A sudden pitch down was one thing, but two sudden pitch downs was quite another. With all kinds of alarms going on and off in the background and new error messages appearing constantly, the crew were unsure what was happening and feared they could dive again at any moment. An immediate landing at Learmonth in Western Australia seemed like the best option.

Lipsett, despite his broken nose, at last made it to the cockpit and took over for Hales. He reported that there were injuries among the passengers as well. At this time, Sullivan noted that the automated stabilizer trim wasn’t working; the trim would have to be adjusted manually. The navigation equipment was also not functioning and they couldn’t interact with the computer interface at all. Sullivan declared a pan-pan-pan, one level short of a mayday, and informed controllers that flight 72 was headed to Learmonth with “flight computer problems.” After receiving word from the flight attendants that there were numerous broken bones, lacerations, and other injuries, he upgraded this to a full mayday and requested that ambulances meet the aircraft after landing.

The pilots flew the remainder of the flight in full manual mode, trying to ignore the constant spurious alarms that refused to turn off. First Officer Lipsett called Qantas maintenance in Sydney over the satellite communication system to try and get help to resolve the situation, but they were also unable to figure out what was wrong. However, the sudden pitch downs never returned, and flight 72 landed safely at Learmonth at 1:32 p.m.

Emergency services work on the plane after the emergency landing. Image source: news.com.au

All told, at least 119 of the 315 passengers and crew were injured, 12 of them seriously. The interior of the cabin was utterly trashed. Ceiling panels were broken, passenger service units destroyed, overhead bins wrenched out of alignment. Trash, food, blood, and spilled drinks littered the floor. And while the plane would fly again and no one was killed, many people suffered injuries that will be with them for the rest of their lives — all because of some “ghosts in the code.” Investigators with the Australian Transportation Safety Board had to ask: how could such a thing happen?

As it turned out, it wasn’t the first time this type of error had occurred. Another Qantas A330 had experienced a similar data problem in 2006, also off the coast of Western Australia. And in December of 2008, it happened again on yet another Qantas flight off Western Australia. Neither of these other two cases involved an uncommanded pitch down, but the failure mode of the ADIRU in all three incidents was similar, and two of them even involved the exact same ADIRU. The fact that these failures all occurred within a small geographical region seemed too strange to be a coincidence, but despite a variety of theories, and a call from the Australian and International Pilots Association to ban flights over the area, investigators could find nothing inherent to Western Australia that could have caused the malfunctions.

Damage to overhead bins and passenger service units. Image source: the ATSB

In fact, the ATSB was never able to conclusively find what caused the ADIRU to start sending out false and mislabeled data. Only one theory could not be ruled out: a Single Event Effect, or SEE for short. A SEE occurs when a high-energy particle from outer space, such as a neutron, strikes a computer chip and randomly changes a binary switch from one to zero or zero to one. If a SEE occurred at a critical location within the ADIRU CPU’s memory module, it could, just maybe, have triggered everything that followed. The ATSB was unable to find evidence to prove or disprove the theory, but the fact that the two ADIRUs that experienced this type of malfunction were close to one another in serial number suggested that there might have been some minute hardware flaw in that batch of ADIRUs that made them more susceptible to a SEE.

Damage to the ceiling in the aisles. Image source: the ATSB

What made the failure of the ADIRU dangerous was not that it failed per se, but that the invalid data passed through the many layers of cross-checks without being flagged as such. Had the data spikes been flagged as invalid at some point in the process, the computer would have disregarded them and the safety of the flight would never have been compromised. The investigation found a hitherto unknown failure mode in which data spikes occurring approximately every 1.2 seconds could trick the computer into thinking bad data was real. This was where the real safety problem lay. It might not be possible to prevent a few ones and zeroes from becoming corrupted every now and then, but if the layered protections couldn’t always detect the corrupted data, that represented a safety risk. Those protections were good — the ADIRU itself could weed out 93.5% of invalid data on its own before the computer even did its cross-checking — but this wasn’t enough to prevent a bit of mismatched code from injuring 119 people. In principle, however, the ADIRU remained completely safe. This type of failure occurred only three times in 128 million hours of service for this model of ADIRU, well within the probability zone that regulators consider “extremely remote.”

A man receives medical attention after the accident. Image source: the Sydney Morning Herald

One final angle that the ATSB pursued was the rate of seat belt usage among airline passengers. During the two in-flight upsets, unrestrained passengers crashed into the ceiling and into other passengers, causing injuries not just to themselves but also to others who were wearing their seat belts and otherwise wouldn’t have been hurt. While a few factors could be correlated with lower seat belt use, there was no universal reason why people chose not to wear them. Getting people to wear seat belts when the seat belt sign isn’t on is a challenge that airlines have grappled with for decades. Turning the seat belt sign on all the time isn’t a practical solution because people would grow complacent about its presence and ignore the sign at higher rates than before. Investigators decided that more research would have to be done to find the most effective ways to get around this paradox.

Damage to the aisle ceiling. Image source: NZHerald

In its final report, the ATSB wrote that the investigation was extremely difficult and touched on numerous areas where no air accident investigation had ventured before. The authors of the report were also keenly aware that the Qantas flight 72 incident could be representative of the sort of case that will become more and more common in the modern era. “Given the increasing complexity of [aircraft] systems,” they wrote, “this investigation has offered an insight into the types of issues that will become relevant for future investigations.”

Just days after the accident, Airbus issued a bulletin to all A330 operators instructing pilots to immediately shut off the indicated ADIRU when receiving a “NAV IR” fault. This advice might have prevented a similar accident in December of that year, when the pilots of Qantas flight 71 experienced an identical ADIRU malfunction but switched off the affected unit after just 28 seconds. Regulatory authorities worldwide re-issued this Airbus bulletin as an airworthiness directive, making it an official rule. Airbus also redesigned the logic used by the flight computer to verify AOA data, removing the possibility that well-timed data spikes could make it through the cross-check. And furthermore, Airbus began including novel ways of testing its data verification software, including testing with intermittent data spikes, which had not previously been attempted.

VH-QFA, the aircraft involved in the accident, photographed in 2018. Image source: Masakatsu Ukon

However, the ATSB ran into a problem: although the event that precipitated this failure was so rare that the ADIRU still met all reasonable safety guidelines, it represented only one example of corruption within the vast quantities of information being processed inside an airplane’s many computers. What other loopholes might exist that could cause a software bug, a SEE, or other sources of bad data to manifest in dangerous ways? How could these events ever be predicted?

One way was to tackle one of the suspected sources of errors: SEEs. After the Qantas accident, the European Aviation Safety Agency started asking manufacturers of aircraft computers to take into account SEEs during the design phase to make their products less susceptible. At the time of the report’s publication, the US Federal Aviation Administration was still researching the best ways to approach the problem. Today, understanding of the safety implications of this phenomenon is still developing. Nevertheless, Qantas flight 72 stands out as the first case where investigators delved deeply into a serious software failure — and serves as a reminder of the importance of keeping one’s seat belt fastened at all times.

______________________________________________________________

Join the discussion of this article on reddit here!

And don’t forget to visit r/admiralcloudberg for over 100 similar articles.