It's about what broke, not who broke it

I used to teach a class to new people who had just joined the company. Over the span of almost four years, it moved around a bit in terms of which week I'd see them (third, first, then second), but the class always focused on the same thing: troubleshooting and outages.

I got into it almost by accident. One of my teammates had created it, and had a slide deck and a bunch of interesting topics to share. He would teach it about once every two weeks. Then, one summer, he went on vacation and needed coverage, so I offered to take it for him. I turned out to enjoy it so much that I asked to do more, even when he was still available.

Over the months and years which followed, it acquired a life of its own.

While there are details I'll probably never be able to share, there are ways of talking about what went on in that class. For starters, I would usually mic myself up a few minutes early, and would chat with the people who had arrived before the start time.

I'd ask them "who here has heard the rumor that (something that seems completely outlandish including who supposedly did it) happened, and it took down (way more things than you'd imagine)"? Some hands would go up, some people would murmur. Then I'd say "is it true or false?", and it was always interesting to see which way the room would go. Depending on what they had heard and how talkative they were, it might go either way.

I'd then say "it's true" and later, during the actual class, would tell them the story of a time someone did something innocent, tripped over a three-year-old bug, and managed to effectively unplug everything.

I had to then tell them that this person still worked there. I'd then say that I would not tell them who it was, and that it didn't matter. It wasn't their fault that something broke. They managed to find something that had been written into the code years before they probably ever thought of joining the company, and were doing something that should have worked when it fell over.

The important thing at the time was that we cared about what broke, not who broke it. Who broke it is frequently just a roll of the dice: who got that particular task, bug, or ticket assigned to them, and happened to run this valid command instead of that also-valid command? Why would you ever assign blame based on that?

If anything, you'd want to find out the general pattern of what had broken and then go scour the code base to see if it had happened anywhere else. Chances are, someone didn't just come up with that particular string-mangling "wizardry" out of thin air, and they either picked it up from some other part of the code, or someone else later copied from them and did the same stuff in their own code.

Maybe they never got the memo that you're really not supposed to be doing old-school C-style char* manipulation with [] accesses and pointer math and all of this in C++ code that isn't a particular "hot spot" in the system. There are reasons that we like using actual strings and things with bounds-checking, right?

In any case, however it got there, it has to be found and eradicated in the code. Then it has to actually get built and pushed. (Far too many outages occur because the fix is only "in trunk" and never got pushed in time.) After that, the follow-up involves some kind of static code analysis, lint rules, or whatever else is necessary to positively keep people from putting it back in.

Teaching is not enough. There are far too many people going through the revolving doors of these companies to ever think that you could possibly get them all and keep it all fresh in their heads. You have to design a system such that the natural thing to do yields a good result and doesn't put anyone in harm's way.

"What broke, not who broke it" is another one of those cultural touchstones within a technical environment. You should keep an eye on it and see if it's still being honored, or if it starts being ignored. When things change, be prepared to change with them, or be prepared to suffer.