Some items from my "reliability list"

It should not be surprising that patterns start to emerge after you've dealt with enough failures in a given domain. I've had an informal list bouncing around inside my head for years. Now and then, something new to me will pop up, and that'll mesh up with some other recollections, and sometimes that yields another entry.

One day I took a whack at getting it into text form. The items that can apply outside of a given company might be interesting. As for the ones that keep coming up, well, I guess you can join me in facepalming a little when it happens again.

I'll list some of them here and some of the thinking behind them. Just about everything here has happened at some point in time, and probably has happened more than once... way more than once.

Item: Rollbacks need to be possible

This one sounds simple until you realize someone's violated it. It means, in short: if you're on version 20, and then start pushing version 21, and for some reason can't go back to version 20, you've failed. You took some shortcut, or forgot about going from A to AB to B, or did break-before-make, or any other number of things.

It doesn't really matter HOW it was violated, just that it was. If you keep the general premise of "how would we roll back" in mind when designing and reviewing something, perhaps it would not happen so often.

Item: New states (enums) need to be forward compatible

Rollback compatibility isn't just about the code. What about the data emitted by something? Let's say you have a complicated state machine, and you add another state in version 21. It starts writing out records using that state, and that state didn't even exist before. If any version 20 instances ingest it, they're going to blow up.

So guess what happens if you try to roll back to 20? That's right, you quickly find out you can't. This puts you between a rock and a hard place in terms of dealing with a disaster if 21 should fail. If something bad happens, you now have to come up with a "fix forward" approach and hope that nothing else wrong is baked into that code.

Otherwise, you get to play "hand hack the state storage to fix the records that got ahead of what version 20 can handle".

This is kind of another A, AB, B, type of thing, and it crosses paths with Postel's Law. In this case, you need to make sure you can recognize the new value and not explode just from seeing it, then get that shipped everywhere. Then you have to be able to do something reasonable when it occurs, and get that shipped everywhere. Finally, you can flip whatever flag lets you start actually emitting that new value and see what happens.

Item: more than one person should be able to ship a given binary.

Hopefully this one gets people to stop and think "what do you mean, only one person, that would mean..." and then their thoughts spiral off into bad places and hopefully they start seeing the weak spots in their organizations.

This one is pretty straightforward: if only one person can ship a binary, what happens when they're not available? You get to work backwards on this one, too. How did you get into this state? This one is fun because it's both a tech question and a management question. Neither the coders nor the managers should have let this happen.

Hopefully you find this one out before you run into a situation where a critical bug fix sat in trunk for months and took down your business because someone was on vacation and the fix never went out.

Item: using weak or ambiguous formats for storage will get us in trouble.

Translation: if you aren't using something with a solid schema that handles field names and types and gives you some way to find out when it's not even there, you're going to be in a heap of trouble. It might look fine initially, but once time has passed and a few more programmers have rotated through, you'll be in a world of hurt.

Protobuf, Thrift, whatever, just use something like that. And, that leads me into the next one...

Item: there is far, far too much JSON at $company.

In the original, "$company" was the actual company name where I first put this into writing. The problem is, this holds true in far more places than that. JSON is not what you want when you're blasting data around and web browsers are not involved at either end. Service A talks to service B? Great! See the previous item, and use one of those.

On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, it's your funeral.

Item: if one of our systems emits something, and another one of our systems can't consume it, that's an error, even if the consumer is running on someone's cell phone far far away.

Again, you might guess which company this came from, but it holds true everywhere else, too. If you wrote the server, and it sends out some data, and you wrote the client, and that client can't ingest it, that's a problem! It's not suddenly OK because the client is actually running on some random person's device out in the world.

Sure, some tiny quantity of users are the people who are actively fuzzing your client, trying to see what features are in it that haven't shipped yet, but everyone else is just trying to USE it normally. If you throw out a blanket decree to not investigate these things just because someone might be fuzzing it, that's just lazy.

Incidentally, this applies even more forcefully between internal services. If two parts of the company are talking to each other, and there's anything but a success for that request, something's wrong. A "HTTP 400 bad request" is only the sort of thing you get to ignore when it's coming from the great unwashed masses who are lobbing random crazy crap at you, trying to break you. Internally, that's something entirely different.

(Also, if you are literally having HTTP 400s internally, why aren't you using some kind of actual RPC mechanism? Do you like pain?)

That's enough for now.

August 1, 2019: This post has an update