— Build it, containerise it, ship it

We cut the number of running instances by about 50% and reduced the monthly EC2 running costs up to 40% per month compared to the old CodeDeploy based setup by going to Kubernetes

When I started at my current employer ShopGun it was for the purpose to containerise the current AWS platform and move everything into Kubernetes

Sounds pretty simple when you put it like that doesn’t it? 😃

After about 9 months of intensive work the last software project was moved into our production Kubernetes cluster. Setting a milestone for the devops team

Not only had we substantially lowered the EC2 running costs. We did this while acquiring our biggest competitor in Norway Mattilbud, taking their whole infrastructure in under our wings considerably growing the user base for the whole platform

Backstory

When I started most of the platform was provisioned with terraform or cloud formation code that created an instance per project and environment. In total about ~70 different instances and a bunch of ELB’s, about a third of it being the staging environment

There was a Jenkins instance building code and we mostly used CodeDeploy to get the projects out to the instances

It was a solid, but not very cost efficient setup

So what I got to work with was about 20 software projects in various shape size and form running on an EC2 instance each per environment.

Most of them written in Erlang, Python, Nodejs, PHP or Go

All of them had to go into containers

A new pipeline is born

With a hello-world project written in Erlang at hand, I set out on my quest one late Friday evening. It ended sometime in the Saturday morning, the foundation had been laid and I now had everything I needed to proceed

One of the early obstacles to overcome was how we would build our software in the new stack

Jenkins being Jenkins, it would occasionally give up and need a kick. Some of the projects required rather odd and clunky deploy scripts to work as desired on AWS. There was a desire for something easier to interface with, more transparency and most of all more streamlined

After some tests and looking around I decided to go with CodePipeline, CodeBuild & Elastic Container Registry for creating my Docker images

I ended up writing my own lambda function in Python that triggers as a step in the CodePipeline for each project to deploy the code to the different Kubernetes clusters

Manifests are stored as Jinja2 templated Kubernetes YAML resources in the Github repo

Runtime configuration for pods is mostly done with environment variables and are stored as secrets in Kubernetes to be consumed on pod creation

In the end, on a conceptional view this is kind of how it looks

The developer commits code to Github

CodePipeline fetches the code and sends it to CodeBuild

CodeBuild builds the Dockerfile & pushes it to ECR

CodePipeline triggers the deployment of the built docker image

The deployer creates secrets in k8s from secure storage

The deployer renders the YAML files provided by the Github repo & creates any defined resources such as Deployments, stateful sets, services, ingress rules etc

All the steps above would get feedback posted in our engineering channel, also provided is a small one-liner the developer can run to tail the build output in realtime in their terminal anywhere in the world

Once the pipeline was running and we had some Docker images building the project could continue

Setting up a Kubernetes cluster

Provisioning Kubernetes clusters can be quite the pain when done manually, there is a lot of buttons and knobs that can be turned and a lot that can go wrong

You actually need a fully operating ETCD cluster in the backend which most of the cluster installers makes little to no notice about how to operate or recover if something goes south

I’ve provisioned Kubernetes clusters on bare metal before and have some examples here on how it can be done with CoreOS ( Warning the content is rather old now and not maintained )

Now working all in the cloud I needed to find a better way to do it for my own sake and anyone that was going to work with the stack

In the beginning a bunch of tools & methods was considered:

Ultimately it landed on kops as it at the time felt like a well maintained project and the tests I did with it yielded good results. It also provided terraform output which easily could be checked in to Github

For network CNI kube-router was used as I became one of the maintainers for it some time ago after writing most of the metrics for it. I also consider dogfooding a virtue.

The pipeline strikes back

Pulling the same things over and over for each build quite fast becomes tiresome and a waste of time. Build a set of base images for runtime and building that you use in multi-staged Docker builds to speed it up

Be mindful of your images. The larger image the larger time it takes to deploy. If you are going to make Dockerfiles be sure to fully understand how they work before adding layers without regards to the end result. Your tiny micro service might end up being the size of a fully installed operating system otherwise.

Add git commit ref to your docker image tags, it makes it super easy to find exactly what code they are built from

Mirror copies of public images used to your ECR account or make a local image cache. Docker hub has been down several times historically and you don’t want a cluster outage on your hands due to new nodes not being able to download the public images you are using

Return of the reverse-proxy (ingress)

One of the common problems with Kubernetes clusters is how to get traffic into the cluster from the outside world. On cloud this is pretty convenient compared to bare metal but it’s still not a simple task if one breaks down what goes on behind the scene

In our setup traffic is terminated at the outer edge by a set of ALB’s that also handles HTTPS termination.

These in turn have Skipper as backends. Skipper is listening on a specific port on a set of nodes in our cluster and would act as a reverse-proxy routing the traffic further into the cluster and providing things such as shadow traffic, rate limiting, circuit breakers, header rewrites and a bunch of other goodies

I tested multiple solutions, including (no specific order):

The Nginx controller served me well on the bare metal time but had some flaws and required me to setup all the network in front of it + that the controller boils down to writing Nginx config files and reloading these on changes

Not very ideal in a landscape where this might be happening at a more or less constant rate. I felt we needed something more designed for this purpose

I had some success with the alb-ingress-controller from CoreOS/Ticketmaster which allows a user to automatically provision ALB’s for ingress resources.

Skipper from Zalando worked very well for me in my tests and paired with the kube-ingress-aws-controller provides seamless access to my services via ingress from the outside. The setup also handles mapping of HTTPS certificates from AWS ACM to the newly provisioned ALB’s automatically if set it up properly.

Skipper is an HTTP router and reverse proxy for service composition. It’s designed to handle >300k HTTP route definitions with detailed lookup conditions, and flexible augmentation of the request flow with filters. It can be used out of the box or extended with custom lookup, filter logic and configuration sources.

It has a very rich feature set and the developers is always fast to respond if one discovers any bugs 🙏 (Shout-out to sszuecs for being there when i’ve had questions)

The storage menace

Storage in AWS is simple, reliable and there is a bunch of options with EBS & EFS. If these does not fit your needs there is support for a variety of filesystems. GlusterFS & Ceph being among them

EFS is NFS based. I’m not going to go on a long rant about why, but I generally tend to avoid this storage unless the specific usage case would work with it

EBS is block storage attached to the local instance. Perfect in most cases for anything needing persistent storage but it comes with a drawback. EBS volumes is bound to the AZ they are created in

A typical Kubernetes cluster on AWS that is highly available would span multiple Availability Zones in a region to be able to survive one of these going down

If you have monoliths that required persistent storage and they can work on a NFS filesystem. EFS is perfect here and would allow you to enable the application to run in any AZ in your Kubernetes cluster.

efs-provisioner is a service you can run inside your Kubernetes cluster to allow provisioning and use of a EFS backed filesystem in your pods

Be vary of your IO credits for a EFS filesystem, by default you only have a specific budget to spend and depending on the filesystem usage you might get throttled

For EBS volumes to work reliably in Kubernetes I recommend splitting your compute nodes into Instance Groups bound to specific AZ’s from the beginning and always have a minimum of 1 node in each AZ.

If you have multi AZ spanning IG’s/ASG’s the risk is that when a new node is spinning up to cover for a old that went bad, it might come up in the wrong AZ and now your pod can’t start as it’s persistent storage isn’t available

Attack of the IAM credentials

One later problem that emerged was IAM credentials. In the old stack every instance had it’s own set of permissions but with Kubernetes now several softwares with different permission requirements had to run on the same node

Assigning a broad set of permissions to compute nodes didn’t feel like a optimal solution. After digging around some I found kube2iam

I tested kube2iam but it had several race conditions where IAM credentials not would be available on pod start forcing me to use weird init scripts in containers that had longs sleeps in them. Not very kosher in Kubernetes pod lifecycle management

I found KIAM and coupled with some ENV variables I inject during deployment it’s been working out well so far

I even ended up making a Grafana dashboard for the project

Revenge of the api access

The whole infrastructure is set up on a private VPC and it was rather clunky to access the api for Kubernetes via ssh port forwarding over a jump host which requiring both maintenance and credentials

kubectl is a god sent gift and I wanted our developers to be able to fully enjoy it to interact with their services in the clusters

To get around the jump host ssh-tunnel history I created a lambda function that also runs as a CloudWatch triggered cronjob to perform cleanup. It’s main function is to be invoked by developers to whitelist their IP in a SG that is connected to a LB making them able to talk to the respective clusters api servers

The developers awaken

With both a pipeline and cluster running the developers started to catch on and needed to interface with the new world order

In the beginning tokens were handed out but it required a secret config file per user and clunky reverse ssh tunnels secured with private ssh keys.

With the lambda function this was now down to just a token per user

The last access token

aws-iam-authenticator lets your users authenticate against your cluster using only their AWS credentials

This was the missing puzzle piece. Now all our developers needed to access our Kubernetes clusters is:

aws cli tool (pip install)

non-secret kubectl config file (avail on internal documentation)

a bash login script invoking the whitelisting lambda (part of the toolkit everyone is expected to download)

the authenticator software (one line installer)

AWS credentials (Given to every developer so they can work with the system)

kubectl binary

With this they can now debug and monitor their projects just as if they had direct access to the instance in the old setup

Software builds, tests, deploys and run integrations tests just by a git commit

Our Erlang developers can with the stroke of a few keys get a live remote console directly into their application securely from anywhere in the world

Some gotchas

Be sure to read up on how Dockefiles and layers work. You can save hundreds of megabytes and in some cases gigabytes on properly structured images

Plan for DNS! This is the lost child of Kubernetes that has been flying under the radar for quite some while. To read about the problem in-depth head to

Kubernetes official way of handling this seems to be https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/README.md

Zalando has a example on how they do here

Be aware of how memory and cpu work inside containers. Docker uses CGROUPS to limit resources to containers. This means your container might see the whole nodes avail ram while it’s limited to 200mb and won’t see the KILL coming once it exceeds the 200mb it’s been assigned

Some good reading about containers and java memory handling can be found @ Java inside docker: What you must know to not FAIL

Make a habit of using a lightweight init-system in your containers such as dumb-init

Several times i’ve ran into the problem of PID starvation on nodes due to containers lacking a zombie reaping PID 1 and something on a loop in the container dropping children like it’s in a Rob Zombie movie

Ship artifacts not build environments!

(Famous) Last words

There is some things I haven’t mentioned in this article. The monitoring setup would probably fill up equally much on how everything ties together (The prometheus-operator together with Grafana has been serving us well)

Last. If you want to chat with other Kubernauts or find most of the Special Interest Groups (SIG’s) be sure to signup for the official Kubernetes Slack @ http://slack.k8s.io/

Thanks for reading and I hope you found my writeup interesting!

Dictionary

AZ — Availability Zone

ALB — Application Load Balancer

ASG — Auto Scaling Group

EBS — Elastic Block Storage Service

ECR — Elastic Container Registry

EFS — Elastic File System

ELB — Elastic Load Balancing

IG — Instance Group