Photo by Mikes Photos from Pexels

Maybe because my father is a car mechanic (even now at the age of 70), I’m a strong believer in knowing what’s happening under the hood. I’m not talking about these newfangled cars with just a battery and motor. He’s too old for that. I’m talking about those fossil fuel burning vehicles with their dozens of complex systems each comprising of tens or hundreds of smaller moving parts.

In my less-physical world of computer systems, knowing what’s happening under the hood has several benefits. This includes knowing when something doesn’t look right, knowing the right questions to ask when filing a support ticket, or simply recognizing whether those providing a service are doing a good job. Gaining a deeper understanding of what goes on under the hood of Kubernetes was just one of the reasons we decided to roll our own Kubernetes clusters instead of going with a hosted solution, such Amazon’s EKS or Google’s GKE.

Hosted Solutions

We’re a startup and our Platform Team is just a handful of merry bandits. While the intro to this blog post might have led you to think we’re anti-hosted solution, it’s actually the opposite. Wherever possible we’ve opted for hosted solutions. There just isn’t enough time in the day for patching, rebooting and figuring out why logrotate isn’t working on twenty different services across hundreds of machines. Amazon Web Services, our platform of choice, provides some great hosted, but more importantly resilient, solutions for the bulk of our needs.

For instance, for all our databases we use AWS RDS with its Multi-AZ fail-over capabilities, its encrypted-at-rest storage and its ease of backup, clone or resize. We’re in the process of moving our Ansible-crafted ActiveMQ servers to the hosted solution Amazon MQ. Even when we’re not using hosted solutions, we still utilize services such as AWS Auto-Scaling Groups. This gives us resilience and peace-of-mind. Terraform also plays a big part.

As we move all our services to Kubernetes, we’re migrating off of another Amazon hosted solution, Elastic Beanstalk.

So why not hosted Kubernetes?

Not all hosted solutions are created equal

When looking at hosted solutions the question you want to ask is “Can I trust this?” Has it stood the test of time and evolved over a long enough period with a large enough user-base. Amazon EC2 or RDS are a couple that would obviously make this list.

On the other end of the spectrum is something like Amazon MQ. It’s pretty fresh. So why are we adopting this one? Well, it’s also pretty simple. It’s an instance of a single process of ActiveMQ running on an EC2 machine that shares its datastore with a second instance of ActiveMQ running second EC2 machine. They don’t even share DNS, so fail-over is mostly on the client to handle. This is the way AWS often roles out new products. They start with the bare minimum, get feedback, iterate and grow features and complexity over time. It’s the classic lean approach. Yes, sometimes the lack of features is frustrating, but uptime wins in this game.

Kubernetes ain’t so simple. Providing “the bare minimum” of Kubernetes is hard. Like the gas guzzling engine I mentioned earlier, there’s a lot of moving parts made up of a lot more smaller moving parts. Google and Microsoft already had hosted Kubernetes solutions available when AWS was leading up to its EKS launch. The prospect of placing our faith in AWS’s late-to-market unusually complex first offering was starting to make us nervous.

Wait or jump?

We’d been sketching out our Kubernetes plan for several months and at that point AWS EKS still seemed like the obvious choice for us. This was despite Google’s hosted Kubernetes solution, GKE, being available and getting great reviews. Moving from AWS to Google was an option, but we had little other incentive to make such a dramatic change to the rest of our infrastructure.

As the months rolled by, we hadn’t managed to get on the AWS EKS closed beta. Our options were to keep waiting, start playing with GKE (investing in a platform we wouldn’t use) or roll our own on AWS and start rolling things forward.

Kubris

From going to DevOpsDays and other events we started to realized just how frightening the prospect of running a Kubernetes cluster could be. Was Elastic Beanstalk really that bad? Were we bringing unwarranted risk to a blooming startup? The risks seemed to be there, regardless of whether you rolled your own or went with a hosted solution.

Kubris, n.: the quality of extreme or foolish pride or dangerous overconfidence in one’s Kubernetes skills. See also ‘clustastrophe’.

— John Arundel

At one unconference session I heard someone say that you needed a team of 16 people to run your own Kubernetes. Minimum. That was our entire Engineering department!

Regardless, instead of cowering in the corner pooping our pants, we decided to go forth, assume the worst would happen and learn what we could. The best way to maximize learning be would be to roll our own with the plan to switch to a hosted solution when AWS EKS became available to us.

Bring in the best

The world is a big place with a lot of Kubernetes experts, so I don’t know if Luke was “the best”, but he was damn good and exactly what we needed. Luke Kysow was a ex-Hoosuite employee that we brought in on contract for a few months to help us quickly get up-to-speed with Kubernetes.

We worked closely, doing everything with Terraform and closely reviewing and discussing each pull request. We chose low risk services to move to Kubernetes and designed a process for graceful zero-downtime migration of these services from Elastic Beanstalk to Kubernetes. More importantly, we ensured the migration could be reverted very quickly.

Side note: We use kops to generate Terraform code and then apply that, rather than using kops to manage the cluster directly.

Rollback

I won’t get into the technical details, but at a high-level we relied a lot on low-TTL weighted DNS to phase over traffic between Elastic Beanstalk and Kubernetes. ActiveMQ traffic was a little more challenging. As time passed and confidence grew we would scale down more of the Elastic Beanstalk infrastructure, which inversely increased the rollback time.

Document everything

Well, not everything, but we documented a lot. One area of documentation was the weekly training sessions we did. These focused on different areas of the stack. We went deep into the components of Kubernetes, what they did and how they fail. We stepped through the network stack, following the thread from the browser to the load-balancer, to Kubernetes hosts, through a litany of iptables and into the service process. We followed a similar flow from one service to another.

Failure Wednesdays

I believe the classic term is “Failure Fridays”, but we like Wednesdays. This is a time when we block out several hours and break stuff. This started with Luke helping us get up to speed and ensuring we were playing attention in class. He’d break something in the stack and exclaim “Oh no! Your app is down! What could it be?”. Was a Kubernetes service not working correctly? Something screwed up in iptables? Who deleted our load balancer?

This was both fun and hard work. It got us frantically looking back at our own documentation and getting more familiar with the broader Kubernetes documentation. We would get through three or more of these “outages” in session.

After a while it’s hard to come up with new ways to break things so that we can exclaim, “Oh no! Your app is down! What could it be?” Instead, we have evolved this to seeing how things respond as we change them. For instance, we may use this as an opportunity to upgrade a non-critical Kubernetes cluster, roll the machines, restore things from backup (Heptio Ark, ftw!), or fail-over RDS databases. Each time, we’re looking for gaps in our knowledge on how the system behaves. An important aspect to this is the dedicated time we take for the team to lock ourselves in a room with a whiteboard, big TV to screen-share and sometimes… pastries!

One week we decided to see how fast we could build a new cluster from scratch. We found a few steps we hadn’t documented and had no code for. Some head scratching moments. Did we handcraft that part? Regardless, we had found another gap and filled it.

We’re still no experts

We’re definitely far from expert level at the game of Kubernetes. That said, we feel confident enough in our setup that we currently struggle to see how moving to a hosted solution will benefit us right now. It may actually constrain our learning and control of the platform. We continue to strive to keep it as simple as possible, slowly layering on more monitoring and metrics to increase our understanding.

I think about “Kubris” a lot.