We wanted to scale our teams further but maintain the principles of what helped us move fast: autonomy, work with minimal coordination, self-service infrastructure.

Kubernetes helps us achieve this in a few ways:

Application-focused abstractions

We operate and configure our clusters to minimise coordination

Application focused abstractions

At the core of Kubernetes are concepts that map closely to the language used by an application developer. For example, you manage versions of your applications as a Deployment. You can run multiple replicas behind a Service and map that to HTTP via Ingress. And, through Custom Resources, it’s possible to extend and specialise this language to your own needs.

These abstractions help application teams be more productive. The ones I’ve described above are pretty much all you need to deploy and run a web application, for example. Kubernetes automates the rest.

In my iceberg picture I showed earlier these core concepts sit at the waterline: connecting what an application developer is trying to achieve with the platform underneath. Our cluster operations team can make many of the lower-level, lower-value decisions (like managing metrics, logging etc.) but have a conceptual language that connects them to the application teams above.

In 2010 uSwitch operated a traditional operations team that was responsible for running the monolith and in relatively recent history had an IT team that was partly responsible for managing our AWS account. I believe one of the things that constrained the success of that team was the lack of conceptual sharing.

When your language only includes concepts like EC2 instances, load-balancers, subnets, it’s hard to communicate much meaning. It made it difficult/impossible to describe what an application was; sometimes that was a Debian package, maybe it was something deployed with Capistrano etc. It wasn’t possible to describe an application in language shared by teams.

In the early 2000s I worked at ThoughtWorks in London. During my interviews I was recommended Eric Evans’ Domain Driven Design book. I bought a copy from Foyles on my way home, started reading it on the train and have referenced it on most projects and systems I’ve worked on ever since.

One of the key concepts presented in the book is Ubiquitous Language: emphasising the careful extraction of common vocabulary to aid communication amongst people and teams. I believe that one of Kubernetes’ greatest strengths is providing a ubiquitous language that connects applications teams and infrastructure teams. And, because it’s extensible, this can grow beyond the core concepts to more domain and business specific concepts.

Shared language helps us communicate more effectively when we need to but we still want to ensure teams can operate with minimal coordination.

Minimise Necessary Coordination

In the Accelerate book the authors highlight characteristics of loosely-coupled architecture that drives IT performance:

the biggest contributor to continuous delivery in the 2017 analysis… is whether teams can: Make large-scale changes to the design of their system without the permission of somebody outside the team

Make large-scale changes to the design of their system without depending on other teams to make changes in their systems or creating significant work for other teams

Complete their work without communicating and coordinating with people outside their team

Deploy and release their product or service on demand, regardless of other services it depends upon

Do most of their testing on demand, without requiring an integrated test environment

We wanted to run centralised, soft multi-tenant clusters that all teams could build upon but we wanted to retain many of the characteristics described above. It’s not possible to avoid entirely but we operate Kubernetes as follows to try and minimise it:

We run multiple production clusters and teams are able to choose which clusters to run their application in. We don’t use Federation yet (we’re waiting on AWS support) but we use Envoy instead to load-balance across the different cluster Ingress load-balancers. We can automate much of this with our Continuous Delivery pipeline (we use Drone) and other AWS services.

All clusters are configured with the same Namespaces. These map approximately 1:1 with teams.

We use RBAC to control access to Namespaces. All access is authenticated and authorised against our corporate identity in Active Directory.

Clusters are auto-scaled and we do as much as we can to optimise node start-up time. It’s still a couple of minutes but it means that, in general, no coordination is needed even when teams need to run large workloads.

Applications auto-scale using application-level metrics exported from Prometheus. Application teams can export Queries per Second, Operations per Second etc. and manage the autoscaling of their application in response to that metric. And, because we use the Cluster autoscaler, nodes will be provisioned if demand exceeds our current cluster capacity.

We wrote a Go command-line tool called u that standardises the way teams authenticate to Kubernetes, Vault, request temporary AWS credentials and more.

Authenticating to Kubernetes using u command-line tool

I’m not arguing that Kubernetes has increased our autonomy, although that may be the case, but it’s certainly helped us maintain high levels of self-service and autonomy while reducing some of the pain we felt.