Namely is growing rapidly: we have dozens of engineers pushing out new features across a variety of services every week. We recently rolled out Spinnaker to help our engineers reliably deliver code with zero impact to our customers. We were previously using a set of simple bash scripts powered by Jenkins and needed a better solution for visibility, rollbacks, and configuration for services running on Kubernetes. Spinnaker creates a powerful abstraction for engineers to easily understand which code is running in which environment and seamlessly move code between environments.

Migrating to Spinnaker was no small endeavor. We needed to ensure running Spinnaker was bulletproof. If our CD system fails we are unable to deliver new code and we are unable to mitigate issues with code fixes. We also need to ensure our engineers are comfortable using Spinnaker for code promotions, production deployments, rollbacks and scaling. In this post we are going to cover how we introduced Spinnaker into our ecosystem, learned to operate it reliably, and what abstractions we’ve created around Spinnaker to ease the developer experience.

Running Spinnaker in a highly available manner required us to understand its architecture at a deeper level. This is essential for an engineering organization where dozens of engineers will depend on Spinnaker for their work. Spinnaker is a system designed around microservices where various components work together independently. It also connects with multiple other services (Jenkins, Kubernetes). This creates numerous opportunities for failure and micro-outages. When pipelines fail the SRE team must quickly identify whether the problem lies within Spinnaker, within Kubernetes or within the deployed service itself.

Understanding the Spinnaker architecture allows us to quickly shed light on an event such as clouddriver — the cloud-interface abstraction — losing connectivity to a Kubernetes cluster or a Redis pod failure which is used as the Spinnaker state store. Developing this understanding, designing a deployment strategy, and deciding on how engineers will use Spinnaker requires a multi-pronged approach.

Read The Code

Reading the code for Spinnaker’s microservices is absolutely paramount to gain operational understanding. This is a powerful yet missed tool SREs should employ when working with systems, whether developed internally or externally. For instance, understanding how to override custom profiles was instrumental in tuning Spinnaker to achieve expected behavior during our PoC phase and resilience testing. Spinnaker is written in Java but Namely swarms around Ruby, Go and DotNet Core. We had to deep-dive into into Spring profiles to understand Spinnaker configuration.

With this understanding we were able to speed up the rate at which clouddriver polls agents for an overall snappier experience at our size. Here’s how we added a custom override to the clouddriver-local.yml configuration:

redis:

connection: ${services.redis.baseUrl:redis://localhost:6379}

poll:

intervalSeconds: 15

By default, disabling the previous deployment in a spinnaker pipeline can take up to 90 seconds in a `disable cluster` stage. By increasing the frequency at which orca periodically polls clouddriver can dramatically speed up your pipeline execution. Here’s how you can lower that ask to 25 seconds:

tasks:

disableClusterMinTimeMillis: 25000

HA or Get Out

Most open source tools offer a quickstart approach to get you up and running easily. Although it gets you using the tool quickly it is never appropriate for production use. Spinnaker is no different. A problem of setting up a highly available Spinnaker system is that it requires a persistent storage backing. Out of the box Spinnaker uses a single Redis instance to hold state. On a platform such as Kubernetes this can prove to be problematic when Spinnaker’s Redis pod fails and get redeployed. We came to this realization when older builds were randomly triggering from Jenkins and several builds were running concurrently on multiple environments. We switched to a persistent Redis cluster via ElasticCache after discovering the effects of what occurs when you lose Redis connectivity and the circuit breaker kicks in.

Additionally you need to ensure your polling mechanism can recreate state appropriately. This is handled by the Igor service. In the event of a catastrophic failure we want to ensure old builds are not sporadically triggering in Spinnaker. In igor-local.yml:



build:

pollInterval: 30

pollingSafeguard:

# A lower value can help prevent accidental re-triggers of pipelines, but may require hands-on operations if

# set low.

# See code documentation for more details. Class [PollingSafeguardProperties] in

# https://github.com/spinnaker/igor/blob/master/igor-web/src/main/groovy/com/netflix/spinnaker/igor/IgorConfigurationProperties.groovy#L47

itemUpperThreshold: 1000 spinnaker:build:pollInterval: 30pollingSafeguard:# A lower value can help prevent accidental re-triggers of pipelines, but may require hands-on operations if# set low.# See code documentation for more details. Class [PollingSafeguardProperties] initemUpperThreshold: 1000

Any HA solution requires multiple replicas to exist for mission critical components. Spinnaker is no exception. Critical spinnaker components include CloudDriver, which handles caching of your cloud agent data and determines the state of your deployments. Orca handles orchestration of tasks across Spinnaker. It’s essential to adjust the number of replicas based upon your workload across clusters. If too few resources are allocated you will get task queueing which will delay critical operations like rollbacks.

Learn From Failure (and Bugs)

In order to offer our engineers a better deployment experience via Spinnaker we ensured that any unintended behavior in Spinnaker was investigated and some sort of remediation steps were offered (whether a process change, tooling enhancement, or bug fix). This was a time consuming but it is essential to investigate flakiness or inconsistency. We must ensure unintended behaviors do not bubble up to our production environment. Like any other software, Spinnaker has bugs. Some are caused by the ever changing Kubernetes platform, others by Spinnaker itself. Finding bugs and possible workarounds to them built up our confidence up in using Spinnaker. On a related note Namely has a healthy post-mortem culture so we continuously learn from failure and build resiliency into our platform.

Migrate Incrementally, Build Automation

In our move to using Spinnaker we agreed upon devising a migration strategy that would allow us to move specific projects over that were low risk and were less likely to cause production outages to our customer experience. After moving a handful of projects over we quickly realized that more tooling was requisite to empower our developers to stand up new projects on their own. By creating automation such as our open source k8s-pipeliner and estuary we were able to ease engineers into adopting our new infrastructure with minimal impact in their day-to-day.

Leverage the Community

Open source software is more than sharing code. A healthy community leverages economies of scale: you may be a small organization, but many small distributed organization create a large, healthy community of users. Staying active in the Spinnaker community is an excellent way to stay up to date on how other companies are using Spinnaker. Spinnaker is a complex system with many moving parts. Understanding how other companies are using the product was crucial in determining where we should focus our efforts. In addition, watching the GitHub repository allows us to determine what areas of the project were receiving enhancements or bug fixes.

Automate All the Things

Our approach began by building Spinnaker naively by employing its setup guide and halyard. This worked well initially because it allowed us to understand its internals but grew out of control as we began to add multiple clusters, authentication, and alerting. In order to make our process reproducible and to be able to blow away an existing installation at any moment, we moved to using a terraform setup using remote-exec modules, templates, and hashicorp vault for certificate management. This allows us to treat our Spinnaker infrastructure like any other infrastructure. We version control deployment changes to our Spinnaker topology, configuration changes and secrets. This allows us to reproduce our environment and test new updates easily.

The author would like to thank Michael Hamrah for his contributions to the article.