Monitoring SSL Certificate Expiry in GCP and Kubernetes



Problem

At my current job, we use Google Cloud Platform. Each team has a set of GCP Projects; each project can have multiple clusters. The majority of services that our teams write expose some kind of HTTP API or web interface - so what does this mean? All HTTP endpoints we expose are encrypted with SSL[1], so we have a lot of SSL certificates in a lot of different places.

Each of our GCP projects is built using our CI/CD tooling. All GCP resources and all of our Kubernetes application manifests are defined in git. We have a standard set of stacks that we deploy to each cluster using our templating. One of the stacks is Prometheus, Influxdb, and Grafana. In this article, I’ll explain how we leverage (part of) this stack to automatically monitor SSL certificates in use by our load balancers across all of our GCP projects.

Certificate Renewal

To enable teams to expose services with minimal effort, we rely on deploying a Kubernetes LetsEncrypt controller to each of our clusters. The LetsEncrypt controller automatically provisions certificates for Kubernetes resources that require them, as indicated by annotations on the resources, e.g:

apiVersion: v1 kind: Service metadata: name: app0 labels: app: app0 annotations: acme/certificate: app0.prod.gcp0.example.com acme/secretName: app0-certificate spec: type: ClusterIP ports: - port: 3000 targetPort: 3000 selector: app: app0

This certificate can now be consumed by an NGiNX ingress controller, like so:

apiVersion: extensions/v1beta1 kind: Ingress metadata: name: app0 annotations: kubernetes.io/ingress.class: "nginx" spec: tls: - secretName: app0-certificate hosts: - app0.prod.gcp0.example.com rules: - host: app0.prod.gcp0.example.com http: paths: - path: / backend: serviceName: app0 servicePort: 3000

Switching the ingress.class annotation to have the value of gce will mean Google Compute Engine will handle this configuration. A copy of the secret (the SSL certificate) will be made in GCP as a Compute SSL Certificate resource, which the GCP load balancer can then use to serve HTTPS.

Of course, this isn’t the only method for deploying SSL certificates for services in GCP and/or Kubernetes. In our case, we also have many legacy certificates that are manually renewed by humans, stored encrypted in our repositories, and deployed as secrets to Kubernetes or SSL Certificate resources to Google Compute Engine.

The GCE ingress controller makes a copy of the secret as a Compute SSL Certificate. This means that certificates used in the default Kubernetes load balancers are stored in two separate locations: the Kubernetes cluster, as a secret, and in GCE, as a Certificate resource.

Regardless of how the certificates end up in either GCE or Kubernetes, we can monitor them with Prometheus.

Whether manually renewed or managed by LetsEncrypt, our certificates end up in up-to two places:

The Kubernetes Secret store

As a GCP compute SSL Certificate

Note that the NGiNX ingress controller works by mounting the Kubernetes Secret into the controller as a file.

The following commands will show certificates for each respective location:

Kubernetes Secrets ( kubectl get secret )

) GCP compute ssl-certificates ( gcloud compute ssl-certificates )

Exposing Certificate Expiry

In order to ensure that our certificates are being renewed properly, we want to check the certificates that are being served up by the load balancers. To check the certificates we need to do the following:

Fetch a list of FQDNs to check from the appropriate API (GCP or GKE/Kubernetes) Connect to each FQDN and retrieve the certificate Check the Valid To field for the certificate to ensure it isn’t in the past

To do the first two parts of this process we’ll use a couple of programs that I’ve written that scrape the GCP and K8S APIs and expose the expiry times for every certificate in each:

Kubernetes manifest for prometheus-gke-letsencrypt-certs :

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-gke-letsencrypt-certs namespace: system-monitoring labels: k8s-app: prometheus-gke-letsencrypt-certs spec: replicas: 1 selector: matchLabels: k8s-app: prometheus-gke-letsencrypt-certs template: metadata: labels: k8s-app: prometheus-gke-letsencrypt-certs annotations: prometheus_io_port: '9292' prometheus_io_scrape_metricz: 'true' spec: containers: - name: prometheus-gke-letsencrypt-certs image: roobert/prometheus-gke-letsencrypt-certs:v0.0.4 ports: - containerPort: 9292

Kubernetes manifest for prometheus-gcp-ssl-certs :

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-gcp-ssl-certs namespace: system-monitoring labels: k8s-app: prometheus-gcp-ssl-certs spec: replicas: 1 selector: matchLabels: k8s-app: prometheus-gcp-ssl-certs template: metadata: labels: k8s-app: prometheus-gcp-ssl-certs annotations: prometheus_io_port: '9292' prometheus_io_scrape_metricz: 'true' spec: containers: - name: prometheus-gcp-ssl-certs image: roobert/prometheus-gcp-ssl-certs:v0.0.4 ports: - containerPort: 9292

These exporters each connect to a different API and then expose a list of CNs with their Valid To value in seconds. Using these values we can calculate how long left until the certificate expires ( time() - $valid_to ).

Once these exporters have been deployed, and if, like ours, Prometheus has been configured to look for the prometheus_io_* annotations, then Prometheus should start scraping these exporters and the metrics should be visible in the Prometheus UI. Search for gke_letsencrypt_cert_expiration or gcp_ssl_cert_expiration , here’s one example:

Visibility

Now that certificate metrics are being updated, the first useful thing we can do is make them visible.

Each of our projects has a Grafana instance automatically deployed to it and preloaded with some useful dashboards, one of which queries Prometheus for data about the SSL certs. When a certificate has less than seven days until it runs out, it turns orange; when it’s expired it will turn red.

The JSON for the above dashboard can be found in this gist: gist:roobert/e114b4420f2be3988d61876f47cc35ae

Alerting

Next, let’s setup some Alert Manager alerts so we can surface issues rather than having to check for them ourselves:

ALERT GKELetsEncryptCertExpiry IF gke_letsencrypt_cert_expiry - time() < 86400 AND gke_letsencrypt_cert_expiry - time() > 0 LABELS { severity="warning" } ANNOTATIONS { SUMMARY = ": SSL cert expiry", DESCRIPTION = ": GKE LetsEncrypt cert expires in less than 1 day" } ALERT GKELetsEncryptCertExpired IF gke_letsencrypt_cert_expiry - time() =< 0 LABELS { severity="critical" } ANNOTATIONS { SUMMARY = ": SSL cert expired", DESCRIPTION = ": GKE LetsEncrypt cert has expired" } ALERT GCPSSLCertExpiry IF gcp_ssl_cert_expiry - time() < 86400 AND gcp_ssl_cert_expiry - time() > 0 LABELS { severity="warning" } ANNOTATIONS { SUMMARY = ": SSL cert expiry", DESCRIPTION = ": GCP SSL cert expires in less than 1 day" } ALERT GCPSSLCertExpired IF gcp_ssl_cert_expiry - time() =< 0 LABELS { severity="critical" } ANNOTATIONS { SUMMARY = ": SSL cert expired", DESCRIPTION = ": GCP SSL cert has expired" }

Caution: The window of opportunity for receiving warnings before cert expiry is extremely slim because The LetsEncrypt controller renewal window happens within 1-2 days of expiry.

Conclusion

In this article, I’ve outlined our basic SSL monitoring strategy and included the code for two Prometheus exporters which can expose the metrics necessary to configure your own graphs and alerts. I hope this has been helpful.









[1] Technically TLS but commonly referred to as SSL