chris @ flickr CC 2.0

In this tutorial we are going to monitor various things about our Kubernetes cluster deployed with Deployment Manager. We are going to use the same Deployment Manager template we have used in the first part of this series to deploy GKE cluster.

Stackdriver is a Google’s fully managed product, designed from ground up and running on Googles own cloud, helping you monitor and troubleshoot your applications running on GCP, AWS or Cloud Native Infrastructure.

To get started with Stackdriver we need to initialize it on our account first. To do that simply navigate to Monitoring under Stackdriver section in Google Console, the process can take a couple of minutes and once it completes you will be able to see the Stackdriver dashboard:

Before we use that dashboard we need to have something we can actually monitor.

Navigate to examples/v2/gke/python once you clone this repository in Cloud Shell and set up the name and zone variables first:

NAME=stackdriver-test

ZONE=us-west2-a

CLUSTER_NAME=stackdriver-test-cluster-py

you can get full list of zones and regions available with:

gcloud compute zones list

run the following command to provision the cluster (make sure that zone and initialNodeCount properties are set in cluster.yaml file):

gcloud deployment-manager deployments create ${NAME} \

--config cluster.yaml

Once the job completes you will be able to navigate to your cluster details with KubernetesEngine -> Clusters -> name of your cluster

As you can see, the Stackdriver Kubernetes Engine Monitoring is not enabled by default. (Btw, I haven’t found any configuration option available for Deployment Manager and GKE where we can enable this directly while provisioning the cluster, neither was I able to enable this feature when editing the cluster settings on the Google Console)

To enable Stackdriver Kubernetes Engine Monitoring execute the following command:

gcloud beta container clusters update ${CLUSTER_NAME} — enable-stackdriver-kubernetes — zone ${ZONE}

Default Metrics Dashboard

Navigate to Stackdriver -> Resources -> Kubernetes Engine where you should be able to see the info about your cluster.

Basic monitoring dashboard organizes cluster information by grouping them in 3 main tabs:

Infrastructure: Aggregates resources by Cluster > Node > Pod > Container. Workloads: Aggregates resources by Cluster > Namespace > Workload > Pod > Container. Services: Aggregates resources by Cluster > Namespace > Service > Pod > Container.

Now, let’s install something there so we can check it’s metrics. Again, similarly to the steps in previous parts of this series we are going to install nginx as it’s pretty straightforward and easy to use.

IMAGE=nginx

PORT=80

gcloud deployment-manager deployments create deployment \

--template deployment.py \

--properties clusterType:${NAME}-cluster-py-type,image:${IMAGE},port:${PORT}

To check if all went good using old ‘n good kubectl you can execute the following in the Cloud Shell Console and check your pods/services status:

gcloud container clusters get-credentials ${CLUSTER_NAME} — zone ${ZONE} kubectl get pods

kubectl get services

After a short while, your new service and pod will be visible at Stackdriver monitoring page:

Once we have our nginx ready, we can open up its port 80 so we can hit it with some traffic

kubectl port-forward $(\

kubectl get pods --output=jsonpath="{.items[0].metadata.name}") \

9999:${PORT}

Open up another tab in Cloud Shell Console and execute:

curl localhost:9999

For a simple load testing we are going to use Apache Bench tool ab which needs to be installed if you haven’t already used one from your Cloud Shell.

sudo apt install apache2-utils

Now, we can stress a bit our nginx server and see the changes on the Stackdriver dashboard:

ab -n 50000 -c 1000 http://localhost:9999/

Navigate deep down to our container in SERVICES tab and click on it. Basic monitoring metrics for memory and cpu should be visible. We can notice that during our ab tests execution CPU request_utilization metric went up significantly.

Metric based Alerting

Of course, if you want to get notified about unusual behavior of some value you monitor for your cluster, you can configure alerts to get notified if something unusual occurs. In Stackdriver those are called Alerting Policies .

Navigate to Alerting -> Create Policies and create your first alerting policy for CPU Request utilization monitored on our container level.

We have to specify 4 basic attributes for our new alerting policy:

condition under which the policy gets violated

under which the policy gets violated notification — who and how will get notified in case the policy gets violated

— who and how will get notified in case the policy gets violated documentation (optional) — when you want to add some information to the person who will handle unhealthy situation in the cluster

(optional) — when you want to add some information to the person who will handle unhealthy situation in the cluster name

For condition , start typing request utiliza... in the Target field and the form will suggest the right metric for our container. Filter the data to the "container" name for the type container and you should see just one metric displayed on the right side graph.

Once we have the metric we are interested in, we can configure a threshold for our condition, exceeding which violates the policy. To make things simple here, you can define the threshold as is above 5–10% for 1 minute which is self explanatory.

After the policy creation you will see the dashboard with view for number of incidents

Now, run again our Apache Bench command to create some more load on nginx :

ab -n 50000 -c 1000 http://localhost:9999/

You need to be careful with the concurrency setting for the ab command with our simple nginx pod as it can become unresponsive at times. In such a case decrease the number of concurrent requests (flag -c ) as well as the threshold for your alert condition.

After a short while you should see a new incident registered at Stackdriver, you can open it up and browse its details:

If you have specified notification email, you will get notified by that channel too, the email contains basic information about the incident with some links where you can take action to find out more.

Don’t forget to clean up everything after you stop playing around with our resources on GCP. For this tutorial we have created 2 deployments which you can simple delete with gcloud commands:

gcloud deployment-manager deployments delete deployment

gcloud deployment-manager deployments delete stackdriver-test

There is a ton of functionality on Stackdriver which I would like to learn and share with you, so watch this space for future blog posts.