Over the last two years, I've worked with a number of teams to deploy their applications leveraging Kubernetes. Getting developers up to speed with Kubernetes jargon can be challenging, so when a Deployment fails, I'm usually paged to figure out what went wrong.

One of my primary goals when working with a client is to automate & educate myself out of that job, so I try to give developers the tools necessary to debug failed deployments. I've catalogued the most common reasons Kubernetes Deployments fail, and I'm sharing my troubleshooting playbook with you!

Without further ado, here are the 10 most common reasons Kubernetes Deployments fail:

1. Wrong Container Image / Invalid Registry Permissions

Two of the most common problems are (a) having the wrong container image specified and (b) trying to use private images without providing registry credentials. These are especially tricky when starting to work with Kubernetes or wiring up CI/CD for the first time.

Let's see an example. First, we'll create a deployment named fail pointing to a non-existent Docker image:

$ kubectl run fail --image=rosskukulinski/dne:v1.0.0

We can then inspect our Pods and see that we have one Pod with a status of ErrImagePull or ImagePullBackOff .

$ kubectl get pods NAME READY STATUS RESTARTS AGE fail-1036623984-hxoas 0/1 ImagePullBackOff 0 2m

For some additional information, we can describe the failing Pod:

$ kubectl describe pod fail-1036623984-hxoas

If we look in the Events section of the output of the describe command we will see something like:

Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 5m 5m 1 {default-scheduler } Normal Scheduled Successfully assigned fail-1036623984-hxoas to gke-nrhk-1-default-pool-a101b974-wfp7 5m 2m 5 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} spec.containers{fail} Normal Pulling pulling image "rosskukulinski/dne:v1.0.0" 5m 2m 5 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} spec.containers{fail} Warning Failed Failed to pull image "rosskukulinski/dne:v1.0.0": Error: image rosskukulinski/dne not found 5m 2m 5 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "fail" with ErrImagePull: "Error: image rosskukulinski/dne not found" 5m 11s 19 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} spec.containers{fail} Normal BackOff Back-off pulling image "rosskukulinski/dne:v1.0.0" 5m 11s 19 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "fail" with ImagePullBackOff: "Back-off pulling image \"rosskukulinski/dne:v1.0.0\""

The error string, Failed to pull image "rosskukulinski/dne:v1.0.0": Error: image rosskukulinski/dne not found tells us that Kubernetes was not able to find the image rosskukulinski/dne:v1.0.0 .

So then the question is: Why couldn't Kubernetes pull the image?

There are three primary culprits besides network connectivity issues:

The image tag is incorrect

The image doesn't exist (or is in a different registry)

Kubernetes doesn't have permissions to pull that image

If you don't notice a typo in your image tag, then it's time to test using your local machine.

I usually start by running docker pull on my local development machine with the exact same image tag. In this case, I would run docker pull rosskukulinski/dne:v1.0.0 .

If this succeeds, then it probably means that Kubernetes doesn't have correct permissions to pull that image. Go read up on Image Pull Secrets to fix this issue.

If the exact image tag fails, then I will test without an explicit image tag - docker pull rosskukulinski/dne - which will attempt to pull the latest tag. If this succeeds, then that means the original tag specified doesn't exist. This could be due to human error, typo, or maybe a misconfiguration of the CI/CD system.

If docker pull rosskukulinski/dne (without an exact tag) fails, then we have a bigger problem - that image does not exist at all in our image registry. By default, Kubernetes uses the Dockerhub registry. If you're using Quay.io, AWS ECR, or Google Container Registry, you'll need to specify the registry URL in the image string. For example, on Quay, the image would be quay.io/rosskukulinski/dne:v1.0.0 .

If you are using Dockerhub, then you should double check the system that is publishing images to the registry. Make sure the name & tag match what your Deployment is trying to use.

Note: There is no observable difference in Pod status between a missing image and incorrect registry permissions. In either case, Kubernetes will report an ErrImagePull status for the Pods.

2. Application Crashing after Launch

Whether your launching a new application on Kubernetes or migrating an existing platform, having the application crash on startup is a common occurrence.

Let's create a new Deployment with an application that crashes after 1 second:

$ kubectl run crasher --image=rosskukulinski/crashing-app

Then let's take a look at the status of our Pods:

$ kubectl get pods NAME READY STATUS RESTARTS AGE crasher-2443551393-vuehs 0/1 CrashLoopBackOff 2 54s

Ok, so CrashLoopBackOff tells us that Kuberenetes is trying to launch this Pod, but one or more of the containers is crashing or getting killed.

Let's describe the pod to get some more information:

$ kubectl describe pod crasher-2443551393-vuehs Name: crasher-2443551393-vuehs Namespace: fail Node: gke-nrhk-1-default-pool-a101b974-wfp7/10.142.0.2 Start Time: Fri, 10 Feb 2017 14:20:29 -0500 Labels: pod-template-hash=2443551393 run=crasher Status: Running IP: 10.0.0.74 Controllers: ReplicaSet/crasher-2443551393 Containers: crasher: Container ID: docker://51c940ab32016e6d6b5ed28075357661fef3282cb3569117b0f815a199d01c60 Image: rosskukulinski/crashing-app Image ID: docker://sha256:cf7452191b34d7797a07403d47a1ccf5254741d4bb356577b8a5de40864653a5 Port: State: Terminated Reason: Error Exit Code: 1 Started: Fri, 10 Feb 2017 14:22:24 -0500 Finished: Fri, 10 Feb 2017 14:22:26 -0500 Last State: Terminated Reason: Error Exit Code: 1 Started: Fri, 10 Feb 2017 14:21:39 -0500 Finished: Fri, 10 Feb 2017 14:21:40 -0500 Ready: False Restart Count: 4 ...

Awesome! Kubernetes is telling us that this Pod is being Terminated due to the application inside the container crashing. Specifically, we can see that the application Exit Code is 1 . We might also see an OOMKilled error, but we'll get to that later.

So our application is crashing ... why?

The first thing we can do is check our application logs. Assuming you are sending your application logs to stdout (which you should be!), you can see the application logs using kubectl logs .

$ kubectl logs crasher-2443551393-vuehs

Unfortunately, this Pod doesn't seem to have any log data. It's possible we're looking at a newly-restarted instance of the application, so we should check the previous container:

$ kubectl logs crasher-2443551393-vuehs --previous

Rats! Our application still isn't giving us anything to work with. It's probably time to add some additional log messages on startup to help debug the issue. We might also want to try running the container locally to see if there are missing environmental variables or mounted volumes.

3. Missing ConfigMap or Secret

Kubernetes best practices recommend passing application run-time configuration via ConfigMaps or Secrets. This data could include database credentials, API endpoints, or other configuration flags.

A common mistake that I've seen developers make is to create Deployments that reference properties of ConfigMaps or Secrets that don't exist or even non-existent ConfigMaps/Secrets.

Let's see what that might look like.

Missing ConfigMap

For our first example, we're going to try to create a Pod that loads ConfigMap data as environmental variables.

# configmap-pod.yaml apiVersion: v1 kind: Pod metadata: name: configmap-pod spec: containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "-c", "env" ] env: - name: SPECIAL_LEVEL_KEY valueFrom: configMapKeyRef: name: special-config key: special.how

Let's create a Pod, kubectl create -f configmap-pod.yaml . After waiting a few minutes, we can peek at our pods:

$ kubectl get pods NAME READY STATUS RESTARTS AGE configmap-pod 0/1 RunContainerError 0 3s

Our Pod's status says RunContainerError . We can use kubectl describe to learn more:

$ kubectl describe pod configmap-pod [...] Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 20s 20s 1 {default-scheduler } Normal Scheduled Successfully assigned configmap-pod to gke-ctm-1-sysdig2-35e99c16-tgfm 19s 2s 3 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Pulling pulling image "gcr.io/google_containers/busybox" 18s 2s 3 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Pulled Successfully pulled image "gcr.io/google_containers/busybox" 18s 2s 3 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "test-container" with RunContainerError: "GenerateRunContainerOptions: configmaps \"special-config\" not found"

The last item in the Events section explains what went wrong. The Pod is attempting to access a ConfigMap named special-config , but it's not found in this namespace. Once we create the ConfigMap, the Pod should restart and pull in the runtime data.

Accessing Secrets as environmental variables within your Pod specification will result in similar errors, like we've seen here with ConfigMaps.

But what if you're accessing a Secret or a ConfigMap via a volume?

Missing Secret

Here's a Pod spec that references a Secret named myothersecret and attempts to mount it as a volume.

# missing-secret.yaml apiVersion: v1 kind: Pod metadata: name: secret-pod spec: containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "-c", "env" ] volumeMounts: - mountPath: /etc/secret/ name: myothersecret restartPolicy: Never volumes: - name: myothersecret secret: secretName: myothersecret

Let's create this Pod with kubectl create -f missing-secret.yaml .

After a few minutes, when we get our Pods, we'll see that it still is in the state of ContainerCreating .

$ kubectl get pods NAME READY STATUS RESTARTS AGE secret-pod 0/1 ContainerCreating 0 4h

That's odd ... let's describe the Pod to see whats going on.

$ kubectl describe pod secret-pod Name: secret-pod Namespace: fail Node: gke-ctm-1-sysdig2-35e99c16-tgfm/10.128.0.2 Start Time: Sat, 11 Feb 2017 14:07:13 -0500 Labels: Status: Pending IP: Controllers: [...] Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 18s 18s 1 {default-scheduler } Normal Scheduled Successfully assigned secret-pod to gke-ctm-1-sysdig2-35e99c16-tgfm 18s 2s 6 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/secret/337281e7-f065-11e6-bd01-42010af0012c-myothersecret" (spec.Name: "myothersecret") pod "337281e7-f065-11e6-bd01-42010af0012c" (UID: "337281e7-f065-11e6-bd01-42010af0012c") with: secrets "myothersecret" not found

Once again, the Events section explains the problem. It's telling us that the Kubelet failed to mount a volume from the secret, myothersecret . To fix this problem, create myothersecret containing the necessary secure credentials. Once myothersecret has been created, the container will start correctly.

4. Liveness/Readiness Probe Failure

An important lesson for developers to learn when working with containers and Kubernetes is that just because your application container is running, doesn't mean that it's working.

Kubernetes provides two essential features called Liveness Probes and Readiness Probes. Essentially, Liveness/Readiness Probes will periodically perform an action (e.g. make an HTTP request, open a tcp connection, or run a command in your container) to confirm that your application is working as intended.

If the Liveness Probe fails, Kubernetes will kill your container and create a new one. If the Readiness Probe fails, that Pod will not be available as a Service endpoint, meaning no traffic will be sent to that Pod until it becomes Ready .

If you attempt to deploy a change to your application that fails the Liveness/Readiness Probe, the rolling deploy will hang as it waits for all of your Pods to become Ready.

So what does this look like? Here's a Pod spec that defines a Liveness & Readiness Probe that checks for a healthy HTTP response for /healthz on port 8080.

apiVersion: v1 kind: Pod metadata: name: liveness-pod spec: containers: - name: test-container image: rosskukulinski/leaking-app livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 3 periodSeconds: 3 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 3 periodSeconds: 3

Let's create this Pod, kubectl create -f liveness.yaml , and then see what happens after a few minutes:

$ kubectl get pods NAME READY STATUS RESTARTS AGE liveness-pod 0/1 Running 4 2m

After 2 minutes, we can see that our Pod is still not "Ready", and it has been restarted four times. Let's describe the Pod for more information.

$ kubectl describe pod liveness-pod Name: liveness-pod Namespace: fail Node: gke-ctm-1-sysdig2-35e99c16-tgfm/10.128.0.2 Start Time: Sat, 11 Feb 2017 14:32:36 -0500 Labels: Status: Running IP: 10.108.88.40 Controllers: Containers: test-container: Container ID: docker://8fa6f99e6fda6e56221683249bae322ed864d686965dc44acffda6f7cf186c7b Image: rosskukulinski/leaking-app Image ID: docker://sha256:7bba8c34dad4ea155420f856cd8de37ba9026048bd81f3a25d222fd1d53da8b7 Port: State: Running Started: Sat, 11 Feb 2017 14:40:34 -0500 Last State: Terminated Reason: Error Exit Code: 137 Started: Sat, 11 Feb 2017 14:37:10 -0500 Finished: Sat, 11 Feb 2017 14:37:45 -0500 [...] Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 8m 8m 1 {default-scheduler } Normal Scheduled Successfully assigned liveness-pod to gke-ctm-1-sysdig2-35e99c16-tgfm 8m 8m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Created Created container with docker id 0fb5f1a56ea0; Security:[seccomp=unconfined] 8m 8m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Started Started container with docker id 0fb5f1a56ea0 7m 7m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Created Created container with docker id 3f2392e9ead9; Security:[seccomp=unconfined] 7m 7m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Killing Killing container with docker id 0fb5f1a56ea0: pod "liveness-pod_fail(d75469d8-f090-11e6-bd01-42010af0012c)" container "test-container" is unhealthy, it will be killed and re-created. 8m 16s 10 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Warning Unhealthy Liveness probe failed: Get http://10.108.88.40:8080/healthz: dial tcp 10.108.88.40:8080: getsockopt: connection refused 8m 1s 85 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Warning Unhealthy Readiness probe failed: Get http://10.108.88.40:8080/healthz: dial tcp 10.108.88.40:8080: getsockopt: connection refused

Once again, the Events section comes to the rescue. We can see that the Readiness and Liveness probes are both failing. The key string to look for is, container "test-container" is unhealthy, it will be killed and re-created . This tells us that Kubernetes is killing the container because the Liveness Probe has failed.

There are likely three possibilities:

Your Probes are now incorrect - Did the health URL change? Your Probes are too sensitive - Does your application take a while to start or respond? Your application is no longer responding correctly to the Probe - Is your database misconfigured?

Looking at the logs from your Pod is a good place to start debugging. Once you resolve this issue, a fresh Deployment should succeed.

5. Exceeding CPU/Memory Limits

Kubernetes gives cluster administrators the ability to limit the amount of CPU or memory allocated to Pods and Containers. As an application developer, you might not know about the limits and then be surprised when your Deployment fails.

Let's attempt to create this Deployment in a cluster with an unknown CPU/Memory request limit:

# gateway.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: gateway spec: template: metadata: labels: app: gateway spec: containers: - name: test-container image: nginx resources: requests: memory: 5Gi

You'll notice that we're setting a resource request of 5Gi. Let's create the deployment: kubectl create -f gateway.yaml .

Now we can look at our Pods:

$ kubectl get pods No resources found.

Huh? Let's inspect our Deployment using describe :

$ kubectl describe deployment/gateway Name: gateway Namespace: fail CreationTimestamp: Sat, 11 Feb 2017 15:03:34 -0500 Labels: app=gateway Selector: app=gateway Replicas: 0 updated | 1 total | 0 available | 1 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 0 max unavailable, 1 max surge OldReplicaSets: NewReplicaSet: gateway-764140025 (0/1 replicas created) Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 4m 4m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set gateway-764140025 to 1

Based on that last line, our deployment created a ReplicaSet ( gateway-764140025 ) and scaled it up to 1. The ReplicaSet is the entity that manages the lifecycle of the Pods. We can describe the ReplicaSet:

$ kubectl describe rs/gateway-764140025 Name: gateway-764140025 Namespace: fail Image(s): nginx Selector: app=gateway,pod-template-hash=764140025 Labels: app=gateway pod-template-hash=764140025 Replicas: 0 current / 1 desired Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed No volumes. Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 6m 28s 15 {replicaset-controller } Warning FailedCreate Error creating: pods "gateway-764140025-" is forbidden: [maximum memory usage per Pod is 100Mi, but request is 5368709120., maximum memory usage per Container is 100Mi, but request is 5Gi.]

Ahh! There we go. The cluster administrator has set a maximum memory usage per Pod of 100Mi (what a cheapskate!). You can inspect the current namespace limits by running kubectl describe limitrange .

You now now have three choices:

Ask your cluster admin to increase the limits Reduce the Request or Limit settings for your Deployment Go rogue and edit the limits ( kubectl edit FTW!)

Check out Part 2!

And that's the first 5 most common reasons Kubernetes Deployments fail. Click here for Part 2 which has #6-10.