Health Checks like a Pro

Introducing go-sundheit

We’re very excited to announce go-sundheit, our shiny new health checks library for golang, designed for high scale services, and large scale deployments.

The go-sundheit project is named after the German word Gesundheit which means ‘health’, and it is pronounced /ɡəˈzʊntˌhaɪ̯t/.

We started this project because at AppsFlyer, as in other fast-growing companies, we have a large operation which is managed by practicing continuous delivery. It is vital that our deployments and runtime are safe. This means that you have to know as soon as possible when your deployment has gone bad or when a resource that your service depends on is in poor shape. We need this so we can sleep well at night.

In order to achieve this level of safety, deployment orchestration systems such as Kubernetes, and discovery systems such as consul, require you to implement endpoints that will define the readiness and liveness of your service. These endpoints will be called upon deployment to verify the success of the deployment, and also called upon periodically to ensure the liveness and health of the service.

The main challenge is that you want these endpoints to be implemented correctly, and work well at scale. Both scale and correctness are sometimes overlooked. What I’ve seen many developers do is implement an endpoint returning 200 OK that looks more or less like so:

The problem with this implementation is that it only represents the availability of the service, or it’s responsiveness, but what it doesn’t tell you is whether the service is able to serve the API requests. This endpoint actually resembles a /ping rather than a /health endpoint. For example, imagine what would happen if Service-A in the diagram below relies on the DB for it’s serving. With the ping strategy, the service claims to be healthy, but it’s actually unable to serve the requests. This is why the health API must reflect our ability to serve requests.

The broken dependency problem

The next step in the evolution of your infrastructure could be a health endpoint that upon request runs a series of checks, and returns 200 OK if those pass, or an error status otherwise. While this approach works well in many cases, and is not that hard to implement, it has a significant flaw. The fact that the checks run on each request to the health endpoint means that you can easily bring the service down to its knees if you call the endpoint too often, and you may also transitively create unnecessary pressure on the downstream dependencies. This scaling issue is often overlooked. At this point you can introduce all sorts of caching mechanisms, but in most cases you will still have requests that will take longer than they should due to the synchronous nature of the checks’ execution.

Is there a way out of this?

This is where go-sundheit comes into play.