If I was setting up curriculum at a university I’d make an entire semester-long class on The Challenger disaster, and make it required for any remotely STEM-oriented major.

Because I think it’s too easy to think of it as just a random-chance disaster or just space/materials engineering problem that only has lessons relevant to that field. And that’s not really the most important lesson to learn from the Challenger disaster!

Because, yeah, you can look at it as just a random-chance disaster, like a natural event. You could look at it as a lesson on problems in rocket design, as the problems with the shuttle program, as the risks we have to take to explore, but most importantly, in my opinion, is the lessons on “Normalization of Deviance.” The Challenger disaster wasn’t a single mistake or flaw or random chance that resulted in the death of 7 people and the loss of a 2 billion dollar spaceship. It was a whole series of mistakes and flaws and coincidences over a long time and at each step they figured they could get away with it because they figured the risks were minimal and they had plenty of engineering overhead. And they were right, most of the time…

Then one day they weren’t.

Normalization of deviance is the idea that things are designed and limits are calculated. We can go this fast, this hard, this hot, this cold, this heavy. But we always want to optimize. We want to do things cheaper, quicker, more at once.

And the thing is, most of the time going a little faster, a little hotter, that’s fine. Nothing goes wrong. Engineers always design in a safety margin, as we’ve learned the hard way that if you don’t, shit goes wrong very fast. So going 110% as fast as the spec says? Probably OK. But the problem is what if you’ve been doing that for a while? You’ve been going 110% all the time. It’s worked out just fine. You’re doing great, no problems. You start to think of 110% as the new normal, and you think of it as just 100%.

You probably don’t rewrite the specs to say the limit is 110%, but you always have the official rules and the “way things are done.” And everyone knows those don’t always exactly align…

Like your job’s security says to never reuse passwords and never write them down and they have to be 20 characters and 4 digits and upper and lower case and 3 Sanskrit characters. The computer tests all those except “never write it down.” Guess which one gets violated? And everyone does, because the alternative is not getting work done because they’re waiting on IT to reset their password. And this just becomes the unwritten How Things Are Done, despite the written How Things Are Done saying explicitly not to do this. And you do this in your office, and you think the stakes are low. And they probably are. But this kind of thing doesn’t just happen to some punks in an office doing spreadsheets. It happens to actual rocket scientists.

So when the spec says 100% and you’ve been doing 110% for the last 20 missions and it seems to be working just fine, and then one day you’re running into 5 other problems and need to push something, well, maybe you do 120% today? After all, it’s basically just 10% of normal. Because in your head you’re thinking of the 110% as the standard, the limit. You’ve normalized going outside the stated rules, and nothing went wrong. So why not go a little more? After all, 110% was just fine…

But the problem is that there’s no feedback loop on this. There’s often no obvious evidence that going outside the “rules” is wrong. Steve wrote down his password and it’s not like he got fired for doing that. So why not do it too?

And the feedback you do eventually, finally get might just be completely disastrous, often literally. You don’t get any “HEY STOP WRITING DOWN YOUR PASSWORDS” feedback until the whole company gets hacked and your division is laid off.

And the feedback you do get is misapplied. Like, I like to joke that my roommate’s cat is very smart. We want to keep her off the kitchen counter for sanitary reasons, so whenever we see her on the counter, we spray her with water. So she learned: never go up on the counter … when there’s someone there to see you. Susan gets in trouble cause she put a post-it note with her password on her monitor, and we had to sit through a boring security meeting about password security. So people learn. They put their passwords in their wallet and in their phone.

This is a silly example, but it’s also exactly what happened with Challenger. The O-rings on the solid rocket boosters had a problem where hot gases would leak past them during lift-off, but every time this happened, the O-ring would shift and reseal the leak. So it was a thing that was never designed to happen, but when it happened and seemed to be fine, they wrote it into the documentation. It was now just a thing that happened. Gas will escape past the O-rings, but it’s okay, they self-seal. And as long as everything was within original operating parameters, this’d be fine. But other things were pushed.

The Challenger launch was repeatedly scrubbed because of minor issues in other components, or cross-winds that were too high. And then NASA finally thought they had a day they could launch, but with one problem: it was too cold.

And it seems a silly thing to worry about it being too cold to launch a SPACE ROCKET but when you design things you have to decide what temperature range they need to operate in. You gotta pick materials and do tests to fit that range. If your rocket is only going to take off in temperatures from 40 degrees F to 90 degrees F, you pick certain materials and test in those temperatures. If you had to launch at colder or hotter, you might need different materials and more expensive tests. So you decide on limits.

But you’ve launched at 40F and it was fine, and then one day you had to launch at 35F and it was fine, and then on a particularly bad day you had to launch at 30F and you’re fine. So you normalize this deviance. You can launch down to 30F, if you really have to. But then one day you’ve missed a bunch of launch windows and it’s 28F and the overnight temperatures were 18F but you did a quick check of the designs and specs and you probably have enough safety margin to launch, so you say GO.

And you discover 73 seconds into the flight that the O-rings that seemed to always self-seal? They don’t self-seal if they’re too hard and brittle from the cold. The gases keep leaking. The hole widens. High pressure high-temperature gas comes out the booster rocket and starts to melt the attachment joints between the boosters and the external tank. It happens at the time when the rocket is undergoing the strongest stresses from take-off, and the tank fails. The solid rocket boosters separate from the now disintegrating orbiter stack and have to be destroyed by a range safety officer. The crew probably survived in the reinforced cabin until it struck the ocean.

And it’s important that the lesson we learn from this isn’t as narrowly focused as “the space shuttle was badly designed” (it wasn’t! It was a compromised design that had lots of amazing work poured into it) or even “don’t launch spacecraft outside their design specs.” Because the thing about Normalization of Deviance as a concept is that it applies to all sorts of engineering issues, and not just mechanical engineering!

Like, think about a road: You know it’s going to be a 50 MPH road, so you design it as such. You don’t put sharp turns in a road where people are going 50MPH, because you know if people try to take them at 70 MPH they’ll crash. And people always push the limits. So you build your “50 MPH” road knowing people might be going 70 MPH. You design your turns & signage for that range. And the road opens and it works perfectly at 50MPH.

But some people go 70MPH, which is fine, you planned for that. The police stop a few of them. But as people go on the road and get used to it, they start going 60 MPH, just cause they can and nothing bad seems to happen. The normal becomes 60 MPH. So now the averages have shifted. You designed for 50 (with a +20MPH safety range) and now most people are doing 60 MPH, and the ones going a little fast do 70 MPH, and the ones going Extra Fast do 80 MPH. And maybe that seems fine. The people going fast know the risks they’re taking so they pay extra attention (for police cars, if nothing else). And it’s fine, for a while.

Then it rains, and what was safe at 50 MPH, borderline at 70MPH, and risky at 80 MPH is now borderline at 50MPH and risky at 60MPH and deadly at 80MPH. And a bunch of people crash. And they crash because they normalized the “rules-in-practice,” of “go 60, go 70 if in a hurry, go 80 if an emergency.”

My point with this is not to say “HEY PEOPLE STOP BENDING THE RULES,” exactly. It’s that you have to consider normalization of deviance when designing systems: How will these rules interact with how people naturally bend the rules?

Maybe you need to make these things explicit in your designs. Like “We can launch down to 39F based on our tests, but if we push that down to 30F we’ll need to do more research to make sure it’s safe in the long run.”

The really sad, scary thing is this kind of normalization of deviance problem didn’t just cost the space shuttle program one orbiter. It cost it TWO. Because 17 years after the Space Shuttle Challenger disintegrated on liftoff, the Space Shuttle Columbia broke up on re-entry.

It wasn’t the solid rocket boosters and their O-rings, it was the insulation on the external fuel tank. It had to be covered in insulation to prevent ice from forming on it, and damaging the tank. But on take-off, the foam often fell off. It was relatively lightweight and didn’t usually cause any problems when it struck other parts of the orbiter. It’d even happened before the Challenger disaster, back in 1983. It was just “foam shedding,” as they called it. A now normal part of launch, even though no one had ever planned for it to happen. And this didn’t cause a problem, the first 112 times they launched.

But on the 113th time, a chunk of foam the size of a suitcase hit the wing in a spot where they couldn’t afford to be hit. And it turns out that even relatively lightweight foam can make a big hole when it hits the wing while the orbiter is moving at Mach 2.46. And they made it into space just fine, completed their mission in space just fine, but when they tried to re-enter the wing-edge temperatures of 2500F caused a failure of the structural components, as hot air entered through the hole caused by the foam block. The foam shedding thing was always a problem. It’d always been a danger to the orbiter, and had been there from the beginning. But they’d gotten lucky 112 times in a row. So they didn’t consider it a priority.

If they realized exactly how this could have caused a complete mission failure, they might have prioritized finding a way to fix the foam shedding. But it’d never been a problem before, so there’s always higher priority issues.

And that’s an element everyone building anything should consider: Your system not breaking doesn’t mean it works and is a solid design. It might just mean you’ve gotten lucky, a lot, in a row.

A real-world way I’ve run into this: at a previous job we had some tests that ran on machines, and they had a step where they’d install some special tools to run the test with, then use them. It turned out the “did we install the tools right?” part was always skipped. But no one noticed, because usually the tests would install fine, and if they didn’t, we’d immediately fail as the test tools weren’t there (or were the wrong ones). So we’d get the expected success or the expected failure. Seems to work fine, right?

And then one day someone makes a change in some unrelated code to try to limit how much we re-initialize test machines. We’ll leave some files in place so we don’t have to reinstall them all the time. And that affected these tools, too. And everything seemed to continue to work. We’d happily install the tools over the old ones, and it’s fine. But then someone accidentally broke the tools with a bad commit… and we didn’t notice for weeks.

Why? Well, the tools were broken now, and wouldn’t compile and install. Which’d be fine and would have triggered failures, except we had that long-standing bug (that we didn’t know about) where failing an install would still continue to run the test. And this had previously always triggered a test failure (due to missing tools) in the past, so we never had any issues with it. But now that we were keeping old files around, it meant the test would still run, as it’d use the files still on the box from the last time it’d worked.

So we thought we were running the new tool code, we thought the tools were working, we thought the tests were fine. We were wrong. Nothing was working. But we were lucky, so it looked like it did.

In the end we only discovered this was happening because we tried to set up some new machines. They naturally didn’t have any tools from the last-run (because they’d never run before) so they were failing tests that “worked” elsewhere.

Another reason why this had happened was because of the size of the log files we generated and all the scary-sounding stuff that happened in them. We had built a system that generated thousands of lines of logs for every test, with lots of “failures” recorded in them. Things like “tried to initialize FOOBAR_CONTROLLER: FAILED!!!,” but we just ran that code on all machines, even ones without the FOOBAR_CONTROLLER hardware. So no one noticed when another 5 lines of errors popped up in a 2000-line log file.

Because here’s the thing: most of the time when there’s a Serious Problem™, it’s not just one event. Disasters aren’t caused by one small event: it’s an avalanche of problems that we survived up until now until they all happen at once.

Like, the Titanic disaster didn’t kill 1,500 people because they had a one-in-a-million chance of hitting an iceberg. Yeah, the iceberg was the linchpin in that disaster, but it’s just the final piece in that jigsaw.

If they hadn’t been going so fast, if the radio operator hadn’t been preoccupied, if the lookout’s binoculars hadn’t been missing, if it hadn’t been a moonless night, if they’d not had rivet problems, if the bulkheads went all the way up, if they had enough lifeboats … It might have been a minor enough incident that you wouldn’t have even heard of it.

Like, in 1907 the SS Kronprinz Wilhelm rammed an iceberg. It was a passenger liner (later a troop transport) and fully loaded would have over a thousand passengers and crew aboard. It survived. It completed its voyage and stayed in service for another 16 years.

You probably haven’t heard of this incident. It’s a single line mention in a wikipedia page. Because they didn’t hit all the failures at once. They rolled the same dice and didn’t come up all 1s.

Maybe they were going slower, maybe they had more lookouts, maybe they had better steel rivets, maybe they just happened to hit an iceberg on a full moon so they had more time to notice they were going to crash and could slow down more. I don’t know.

And the thing about this kind of thing is that these sorts of disasters aren’t just mechanical or natural. This happens to people, too. I was talking to a friend the other day about their situation and we talked about this exact thing: It’s never just one thing.

It’s not like you get yelled at online or a friend is having difficulty and you go from “doing fine” to “nearly suicidal” in one step. No, it happens when all these things accumulate and coincide.

Your friend is going through a hard time and you’re trying to help, and normally that’s fine, but it happens on the day when you’re getting over a cold and your roommate is yelling at the cat and you get an unexpected bill and your fiancee is out of town. Each of these things on their own (or maybe with one or two others) is not a huge problem. You don’t have a breakdown. You don’t have a panic attack. But sometimes the dice come up the wrong way and all of them happen at once.

And I think the moral of the story is that you shouldn’t feel bad about getting pushed over the edge by a “little thing,” nor should you get mad at people for not being able to handle “a little thing.” Because it’s usually not that someone wakes up on a perfectly fine day, healthy and happy, and step outside their door and get hit by a car, and the day goes from GREAT to SHIT in one step. It’s usually lots of little things that accumulate. And you don’t realize each of them piling on until you reach that limit. You especially don’t realize it when it’s someone else hitting that limit!

So give people slack. Be understanding when you ask them to do one thing and they can’t get to it or it causes them stress that you don’t understand. They have other shit on their plate that you can’t see.

And that goes especially for YOU. Give yourself the benefit of the doubt on these things. Too often I see people being mean to themselves in a way they’d never treat anyone else. Be nice to you. You gotta live with you.

When you’re feeling mad at yourself or down on yourself, think about how you’d treat a friend in that situation. You probably wouldn’t go “you idiot, you can’t do anything right, why are you such a mess?” But it’s not uncommon for people to think that about themselves.

In any case, the only real helpful suggestion I can give for these kind of “overload” problems: it’s fine to not address the one that “caused” the issue. It may be the one that pushed you over the edge, but that doesn’t mean it’s the easiest or most important to fix. If you can’t do X because A+B+C+D+E being on your plate has overloaded you, it doesn’t mean you have to directly attack X to fix it, or even the most recent problem (E). You can look at all the problems and find which can most easily be fixed.

Think of it like a video game inventory system. You found a gem and a rusty sword and a health potion, but now you found a key and you don’t have room in your backpack. You definitely need the key, but that doesn’t mean you have to break down and fail the mission. And it’s not the key’s fault that you got overloaded. It’s not even the potion’s fault, being the latest thing. You can look at all your problems and find the one to fix. Maybe that’s the rusty sword, freeing up a bunch of space in one move. Or maybe it’s just the gem: something small and lightweight, but it’ll free up just enough room for a key.

So maybe the answer is “ask your roommate to put the cat in the time-out room for now so they’ll stop scratching them,” so you can handle making that phone call to the vet. Maybe you need to go to the pharmacy and get some cold meds.

My point is just that you can become overloaded like a video game character over their weight limit. When you have Too Much and it’s a problem, you don’t have to Just Bear It and inch your way back to the store to sell all your dungeon loot. When you’re overloaded, reshuffle what’s overloading you and find which ones you can relieve, even if it doesn’t seem to be a direct solution to your problem. Because you’ll have a lot better success in getting things done once you have some capacity to deal with things.

This sort of thing is sometimes called “spoons theory” when it’s related to disabilities. The basic idea is that you have some number of “spoons” you use up during the day on each thing you have to do. Disability means you have to use some of the spoons on the disability.

So you might have 5 spoons and you spend one on work, one on school, one on shopping, and have 2 free for anything else you have to do that day. But with disability you might be spending one every day just on the disability. And on a bad day, you have to spend two or three on it. and now it seems like the normal stuff that normally you have time and energy for, you can’t, because you’ve run out of spoons.

And it’s too easy to read “disability” and think “missing a leg” or “chronic illness like lupus.” Disabilities can be in your head just as easily, because that’s where YOU are. Depression is one. Anxiety, PTSD, ADHD, OCD … there’s plenty of illnesses that can use up spoons.

And maybe you think you’re doing fine, and your friends and coworkers think you’re fine, because you’ve got that spoon to afford on your disability. You compensate, and it works out. And then it’s a day where everything else is going wrong for random chance reasons and now you can’t afford that spoon and it seems like you’re failing and having a breakdown. It doesn’t mean your illness wasn’t there until then and just suddenly affected you! It just means you reached the point where you couldn’t afford your coping mechanisms, because you were overloaded.

It reminds me of how people with retina damage from lasers can have lots of it and it not seem to affect them very much, and sometimes they don’t even notice it because your eye already has a big blind spot, and your visual system works hard to make it seem like it’s not there. It fills in the blank bit. You get too close to a laser without proper eye protection, and now you have retina damage, but what’s one more hole to cover up? So your vision fills in the gap, and then you get more damage, and more, and it keeps filling in, but the total amount you can see is slowly going down, and your vision is worsening. Eventually your vision can’t compensate.

And it’s the same thing with mental illnesses: You cope. You spend spoons on making up for the problems they cause. You may stay functional … but you are spending spoons. You don’t have an unlimited budget.

So think about your workload (and by “work” I don’t just mean the 9-5 money-making sort of work). You have limits. And it’s not a bad thing when you have to cut back, when you have to relax, when you have to take time to heal. Because it often seems to be the nature of how we normalize what we’re successfully doing to keep pushing ourselves and not realize how close we are to being overloaded.

There’s nothing wrong with trying to avoid that point, and there’s especially nothing wrong with having to cut back on what you can do once you do hit that point. If you try to load 9 boxes in your car and only 7 will fit, you don’t get mad at the car for not “toughing it out.” You’re a machine with limits too. Those limits are different because you’re conscious and biological rather than computers and mechanical, but you’ve still got limits. Keep that in mind.