Vertical architecture

There’s one final topic that’s worth discussing.

Sometimes I tell the above to people and they say, “that’s great, but don’t microservices already give us this? what’s the big deal?”. It’s hard to explain why I find polyglot programming compelling without also explaining why I find microservices architectures in need of competition.

Firstly, yes, there are times when you need lots of servers running lots of services working together. I spent over 7 years at Google and worked with their Borg container orchestrator almost every day. I wrote ‘microservices’ although we didn’t call them that, and I consumed them. There was no alternative because our workloads required thousands of machines!

But these architectures come with heavy costs:

Serialisation. That imposes performance penalties but more importantly, requires you to constantly flatten your at least somewhat typed and optimised data structures to simple trees. If you use JSON you lose the ability to do basic things, like have lots of small objects point to a few large objects (you’d need to use custom indexes to avoid repetition). Versioning. This is hard. Universities often don’t teach this kind of difficult-but-mundane software engineering discipline, so even if you think you’ve totally nailed the difference between forwards and backwards compatibility, even if you’re sure you understand multi-phase rollouts, can you guarantee everyone who replaces you will? Are you properly integration testing the different version combinations that can occur during a non-atomic rollout? A significant number of the disasters I’ve seen in distributed architectures boiled down to version skew. Consistency. Atomic operations inside a single server are quite easy. Making sure users always see a fully consistent view when multiple machines get into the mix, especially if data is sharded amongst them, well — it gets a lot harder. That’s why historically relational database engines didn’t scale well. Again, even if you’re sure you’ve got it 100% nailed, are you sure everyone you hire will, stretching into the future? I’ll give you a hint: Google’s top engineers have spent decades trying to simplify distributed programming for their teams, by making it look more like traditional programming. Reimplementation. Because RPCs are expensive you can’t do many of them, so for some kinds of task you have no choice but to reimplement code. Google has some libraries that you use across languages by making RPCs, but for others they had to recode from scratch.

So what’s the alternative?

Put simply, really big iron. This may sound absurdly retro but consider that the cost of hardware falls constantly, many workloads are not ‘web scale’ and your intuitions about what is reasonable may be out of date.

Here’s a recent price list from a Canadian vendor:

A machine with 40 cores, a terabyte of RAM and nearly a terabyte of hard disk goes for about $6k these days. The average salary for a software engineer in New York is about $132,000, so one of these machines costs a less than two weeks of that person’s time. Think about how much time your team will spend over the lifetime of your project on distributed systems issues and what that will cost.

Yeah, but isn’t everything web scale these days?

In a word, no.

The world is full of companies characterised by the following attributes:

They’re in stable markets. They charge money for things. As a consequence their customer base is somewhere between the low tens and the tens of millions, not billions. Their datasets are mostly about their own customers and products.

A good example of this kind of company would be a bank. Banks do not experience ‘hypergrowth’; they don’t go viral. Their growth is modest and predictable, assuming they’re growing at all (banks are regional and usually in saturated markets). The largest bank in the USA has on the order of 50 million users of its website and it’s not going to double within six months. This isn’t Instagram we’re talking about. So it’s not entirely surprising that many banks still have a mainframe somewhere at their core. Of course the same is true of shipping firms, manufacturing firms, etc. The bread and butter of our economy.

In these sorts of businesses it’s plausible that for any particular app their needs fit entirely in a single big machine and always will. Heck, even public websites that give it away can fit these days. In his entertaining 2015 talk “The website obesity crisis” Maciej Cegłowski observes that whilst his own self-hosted bookmarking website was profitable, his competitor hosted on AWS was unprofitable just due to differing hardware costs and complexity assumptions. In Scaling Up vs Scaling Out it was revealed that PlentyOfFish was running on ~one megaserver (the article dates from 2009 so ignore the quoted hardware prices). The author does some calculations and shows it’s not as dumb as it sounds. Finally, in case you’re thinking about Hadoop and Big Data, this Microsoft Research paper from 2013 shows that many Hadoop workloads from Microsoft, Yahoo and Facebook actually run much faster and more efficiently on one big machine than on a cluster. And that was 6 years ago! The economics have probably shifted even more in favour of scale-up since then.

But the real savings don’t come from hardware costs. The real savings come from optimising the ultra-expensive engineering time it takes to build lots of tiny microservices scaling horizontally with elastic demand management. That type of engineering is risky and time consuming even if you use all the latest toys in the cloud. You might lose SQL, solid transactions, unified profiling and you’ll for sure lose things like cross-system stack traces. Type safety will vanish every time you cross server boundaries. You’ll get function calls that can time out, redundant JIT compiler overheads, unexpected loss of backpressure, complex orchestration engines with bizarre pseudo-programmable config formats and … oh my, this really brings back the memories. It was fun to work on these systems when I had Google’s proprietary infrastructure and large engineering budgets to play with, but I’d only do it these days if I had no other choice.

It used to be infeasible to run very large garbage collected servers because GC technology wasn’t good enough, so this topic has been academic for long periods of time — you were gonna run multiple servers no matter what so you may as well embrace it. But with the arrival of ZGC & Shenandoah, terabyte+ heaps running on 80 hypercores in a single process becomes entirely reasonable: your users won’t notice any hiccups. Buy a few big boxes, run your business logic on one, a database server on another and see how far you can get.

So do you really need all your microservices? Or would a careful cost/benefit analysis reveal potential simplifications? Vertical architecture is about bringing old school back: saving money, reusing code and accelerating your team’s velocity by combining the latest cross-language compiler technology with the latest hardware … in traditional ways.