Taking too much slack out of the rubber band

Of late, I keep hearing more and more about elastic this and elastic that. People use it in order to "right-size" their services, and only run the number of instances, pods, or whatever... that their service needs at that moment. They do this instead of the old situation which is where you sized for the most traffic you could need to handle, added some safety margin, and then just didn't use all of it all of the time.

There are problems with this, though. Just like the situation of heat loss when you move things around, there are costs you introduce when something becomes elastically-sized. You have to be able to get accurate signals of when a "scale up" is needed, and then you have to be able to actually deliver on that scaling before you overrun your existing capacity.

The more penny-pinching you do up front by running it close to the line (and needing to "scale up" all the time), the more dollars you'll probably lose in the back end from when it can't happen fast enough.

Here's a very real scenario.

Let's say you have your business, and it relies on a whole bunch of services running all over the place. Maybe you've been tracking the shiny thing of the month, and so are currently embroiled in the micro services thing. Whatever. However you got here, you're here now. You're doing lots of work in tiny increments, and you are highly reliant on responsively scaling in time to handle incoming load spikes.

Now let's say something breaks. Maybe one of your vendors breaks one of those critical 20 services you depend on. This nukes an entire service on your side, and that in turn drives a wedge into the machine that is your business. The entire thing shudders to a stop, and people start running around trying to fix it. Then when they realize it's the vendor, they start yelling at THEM, trying to get them to fix it.

Meanwhile, this local service has no work to do, and it's just chilling out. It's getting requests, sure, but it's throwing them away. Nothing makes it past this particular boulder, since it's an essential part of the rest of the process.

What else is happening? Well, further down the "flow" for the business pipeline, other services have seen their request rate dry up. Since nothing is getting past the big boulder service, the things downstream have nothing to do. Their CPU utilization goes down to almost nothing.

You know what happens next, right? The downscaler kicks in, and a bunch of those instances disappear. After all, they aren't busy, and they don't need to exist, so let's get rid of them and save some money!

Of course, at some point, the vendor fixes their brokenness, and then the local service gets restarted by the ops team (because it can't self-heal, naturally), and traffic starts flowing deeper into the business. Well, it's still prime time, and guess what? All of the later bits have scaled down to record-low levels, like "everyone on Earth was abducted by aliens and stopped using the service" levels. They are just NOT ready to handle the incoming load. Worse still, they can't even deal with part of it, since the actual situation of having too much load breaks them so badly that they can't even pick off a few requests and throw the rest away.

The net result is that couple of initial requests get through for a couple of seconds, then everything else grinds to a halt again as the first scaled-down service is hit by the swarm of locusts. The CPU shoots up, and this kicks off a scaling event -- assuming, of course, that nothing is preventing it from scaling back up because "you just scaled down a few minutes ago". This up-scaling is not a quick thing, and it may be on the order of minutes, or tens of minutes, before it's back where it needs to be.

Of course, with that service restored, the NEXT one down the line starts getting traffic, and it's also down-scaled, and... yeah. You get the idea. It goes on like this until you run out of things to boil, everything else is scaled back up to where it was before (or more, to handle the queued requests), and can survive.

This turned a short vendor-driven outage into a much, much longer locally-driven one.

What do you do about this? How about a big red button, for starters. If anything crazy happens, hit it, and don't let existing instances/pods/jobs/allocs/whatever get reaped. Maybe you'll need them later. Of course, that means you need a human at the conn forever, and that's not great. Maybe you could have some kind of auto-trigger for it.

For another thing, how about knowing approximately what the traffic is supposed to look like for that time of day, on that day of the week and/or year? Then, don't down-scale past that by default?

Or, how about some kind of second-order thing on the rate of decrease, such that it's not just a "divide this by that and that's what we do right now" number? Put some kind of sliding window on it. Of course, that just turns an instant outage into one that only happens after you've been down for a while and are already looking really bad in the press, so watch out for that, too.

Finally, if your business prints money, you could just always have the right number of $whatevers running, and just accept that they won't be as busy all of the time. It's not like you have your own duck curve, right? Look at it another way: when you're running below capacity, you're hopefully running faster. Faster services make for happier users, because people are impatient.

Capacity engineering is no joke.

November 11, 2019: This post has an update