Deploying faster and more often

The new Trunk-based flow allowed us to deliver features to the master one by one. Because of this, we could find broken code quickly and revert to functional code easily. However, we still had long deploy (90 min) and rollback (45 min) times. That gave us a limit of 4–5 deploys per day.

We also faced challenges using SOA with Elastic Beanstalk. The most obvious solution was to use containers with any container orchestration. We already used Docker and docker-compose for local development.

Our next step was to research popular container orchestrators:

AWS ECS

Swarm

Apache Mesos

Nomad

Kubernetes

We decided to use Kubernetes. Every other container orchestrator had drawbacks: ECS is a vendor-lock solution. Swarm leans back to Kubernetes. Apache Mesos is like a spaceship with their Zookeepers. Nomad sounded interesting, but it’s inefficient to use without infrastructure based on Hashicorp products. Also, there are no namespaces in Nomad’s free version.

Despite the steep learning curve, Kubernetes is the de facto standard in container orchestration. It can be used as a service on every large cloud provider. It’s in active development with a huge community and strong documentation.

We expected to complete our migration to Kubernetes in 1 year. Two platform engineers without any Kubernetes experience worked half-time on the migration.

Starting to use Kubernetes

We started with Kubernetes’ proof of concept, created a testing cluster, and documented all of our work. We decided to use kops after Amazon’s EKS support became available in Europe starting in September 2018.

We tested many things, including cluster-autoscaler, cert-manager, Prometheus, Hashicorp Vault and Jenkins integration. Also, we played with rolling-update strategies for the self-hosted cluster when we updated our test cluster. We had DNS issues and a few network issues related to AWS and cluster troubleshooting.

For cost optimization, we used Spot instances. To check Spot instance issues, we used kube-spot-termination-notice-handler. We found out that we could use the Spot instance advisor for checking the frequency of spot instance interruption.

We started the migration from Skullcandy flow to Trunk-based development, where we ran separate stages in Kubernetes for every Pull Request. This reduced feature delivery to production from 4–6 hours to 1–2 hours.

GitHub hook triggers the stage environment creation

We used a testing cluster for these dynamic environments, and every dynamic environment was in a separate namespace. Developers had access to the Kubernetes dashboard for debugging.

We started to get value from the testing cluster 1–2 months after launching our proof of concept, a result we’re proud of!

Staging and production clusters

Here is the set up of our stage and production clusters:

kops and Kubernetes 1.11 (the latest version of kops available at the time of setup)

3 master nodes in different availability zones

Private network topology with a dedicated bastion host, Calico CNI

Prometheus on the same cluster for metrics with PVC (we don’t need long-term storage for our metrics)

Datadog agent for the APM

Dex+dex-k8s-authenticator to provide developers with access to the stage cluster

Staging nodes run on the Spot instances

During the operation of the clusters, we ran into some problems. For example, the versions of the Nginx Ingress and Datadog agent were different on the clusters. Because of this, the staging version worked fine but there were problems on production. To solve the problems, we made the staging and production clusters exactly the same.

Migrating production to Kubernetes

Now that staging and production clusters were ready, we began the migration. Here is the simplified structure of our monorepo:

.

├── microservice1

│ ├── Dockerfile

│ ├── Jenkinsfile

│ └── ...

├── microservice2

│ ├── Dockerfile

│ ├── Jenkinsfile

│ └── ...

├── microserviceN

│ ├── Dockerfile

│ ├── Jenkinsfile

│ └── ...

├── helm

│ ├── microservice1

│ │ ├── Chart.yaml

│ │ ├── ...

│ │ ├── values.prod.yaml

│ │ └── values.stage.yaml

│ ├── microservice2

│ │ ├── Chart.yaml

│ │ ├── ...

│ │ ├── values.prod.yaml

│ │ └── values.stage.yaml

│ ├── microserviceN

│ │ ├── Chart.yaml

│ │ ├── ...

│ │ ├── values.prod.yaml

│ │ └── values.stage.yaml

└── Jenkinsfile

The main Jenkinsfile contains a map for the microservice name and its directory. When the developer merges PR to the master, the tag is created in GitHub and this tag deploys using Jenkins, according to Jenkinsfile.

There are HELM charts for every microservice in the helm directory with separate values files for production and staging. We use Skaffold for deploying multiple HELM charts into the stage. We also tried to use an umbrella chart but it was not scalable for us.

Every new microservice we run on production writes logs to stdout, reads secrets from Vault and has basic alerts (replica count, 5xx errors and latency on the ingress checks) according to the twelve-factor app.

Whether we deliver a new feature broken up as microservices or not, there is still some core functionality in Django. This functionality still works on Elastic Beanstalk.

Breaking up monolith to the microservices // The Vigeland Park in Oslo

We used AWS Cloudfront as CDN because it was easy to use canary deploys during our migration. We started to migrate our monolith and test in on some language versions of our site and admin pages.

This smooth canary migration allowed us to find and fix bugs on production and polish our deploys in a few iterations. Over several weeks, we watched the new platform, load, and monitoring. Eventually, we switched 100% of our traffic to Kubernetes.

After that, we stopped using Elastic Beanstalk.

UPD (Nov 2019): we started to use Skaffold for production deploys also, instead of nested Jenkinsfiles.

Summary

It took us 11 months for our full migration. It was a good result: we expected it to take 1 year.

Outcomes:

Deploy time reduced from 90 min to 40 min

to Deploy count increased from 0–2/day to 10–15/day (and still growing!)

to (and still growing!) Rollback time decreased from 45 min to 1–2 min

to We can easily deliver new microservices to production

We changed our monitoring, logging, and secret management infrastructure to be centralized and written as code

It was an awesome experience working on the migration, and we are still making improvements.

Don’t forget to read this cool article on Kubernetes written by our former colleague Yura, a YAML engineer who helped make it possible to use Kubernetes at Preply.

Also, subscribe to the Preply Engineering Blog for more interesting articles about engineering at Preply.

See ya!