Let’s assume you operate a large bare metal cluster which you rent to your customers to run their workloads.

You notice that the cluster is in fact only used at 60% in average over the last months. You keep 15% capacity free in case of surges, and you need 5% of the cluster to operate the infra itself.

Split of resource usage in the infrastructure

There are peaks at certain times of the day where it loads at 90%, but they are really hard to predict because they depend on your customers’ businesses and do not follow patterns of data you own.

Overall, you always run with about 20% of untaped capacity.

This may happen to you as a service provider or when you have a bunch of servers in your office. Either way, it's waste.

What if we could leverage that pool of resources and put it to good use? Let's see how! Yes Tesla Motors this is for you ;D

“Opportunistic AutoScaling”

Kubernetes and Autoscaling

Since v1.2 of K8s, you can use the Horizontal Pod Autoscaler to autoscale an app based on CPU consumption. Cool but very limited use case.

Since the version 1.6, you may use custom metrics to autoscale an application within a cluster. Much better. It means you can expose metrics such as the number of hits on an API or its latency, then scale the serving pods according to that metric instead of the default CPU load.

Now doesn't that mean we could expose remaining capacity of the cluster (20%), and leverage it to autoscale a money printing application, so that the cluster is permanently used at its optimal capacity?

What you really want as a split

Opportunism

The above behaviour is what I call "Opportunistic Autoscaling". The ability for an app to leverage otherwise unused capacity of the infrastructure, may it be CPU, memory or GPUs.

The business critical app will be measured on how it performs. Your API must always answer with a low latency for example. On the other hand, your non business app can only consume what’s left:

Resources available for autoscaling

In a nutshell, the "remaining capacity" behaves as:

If the “business critical" load on your cluster goes UP, the remaining resources go DOWN, and the opportunistic app shall scale DOWN.

If on the contrary your paid load goes DOWN, the remaining capacity goes UP, and the number of opportunist containers should go UP.

The target is to have a load that is as constant as possible around a threshold you define (80% in our example), thus collecting an average of 20% unused power and monetize it.

Printing money

It may look a silly application but I can definitely tell you that the mining pools seem to see a load increase when office hours finish, showing that (some) business resources are definitely being used for mining at night!

In a real life scenario, crypto mining is effectively adding resources to a compute grid, hence it does also make sense beyond the hype and fun. There are also a lot of other interesting use cases. Among others:

Lambda on the edges using a serverless framework (Telco / Cloud Operator)

Elastic transcoding (Media Lab / Cloud): Think of what Ikea is doing on workstations but in a compute cluster

AI on the edges (Media Lab / Cloud)

Caching (CDN)

The cool use case you’ll share in comments

OK, enough talking! Let's get this done and increase revenues.

DISCLAIMER: Deploying this is fairly complex and involves multiple steps. As a consequence this post is a lot longer and more technical than usual ones. If you are here because you like the use case but do not wish to dig into technical details, you can essentially skip from now to the conclusion.

Using Reversed Custom Metrics

In this blog, we will create a K8s cluster with a custom metrics API on bare metal.

We then create an app that exposes (among others) a metric as follows

Remaining CPU = (total amount of CPU requested in cluster) - (total requested by all applications)

This metrics decreases when the load of the cluster grows, and grows when the load shrinks. This effectively will make the metric a “client” of the requested load on a server.

Then we will use this metric to configure a Horizontal Pod Autoscaler (HPA) in Kubernetes. This will result in keeping the load in the cluster as high as possible.

Full disclosure

This blog post has been sponsored by Kontron who gracefully allowed me to play with a 6 nodes cluster of their latest SymKloud Platform, an awesome piece of hardware in which you can mix and match modules to create a cluster. There are modules with GPUs, CPUs, some dedicated to storage or caching… Each 2U server can contain up to 9 "sleds", effectively going up to 576 cores with dual CPU sleds, or up to 9 nVidia P4 with 288 cores. My cluster had 6 workers, 2 of them with GPUs.

In addition, my friend Ronan Delacroix helped me with the code and wrote all of the python needed for this experiment.

This work is currently presented at MWC in Barcelona on Kontron's booth.

Kubernetes Cluster

It happens that Kontron partners with Canonical to run Kubernetes on SymKloud. You will not be surprised to learn that this post is based on the Canonical Distribution of Kubernetes (CDK), in its version 1.8.

If you need to replicate this in a way or another, you will need a K8s Cluster in version 1.8, with RBAC active, and an admin role.

Important Note: The APIs we will be using here are very unstable and subject to big changes. I really recommend you read the K8s change log to check on them.

For example, there was a change in 1.8 on the name of the APIs. If you run a 1.7 cluster, this will impact you.

There are also changes in 1.9 and the custon.metrics.k8s.io moves from v1alpha1 to v1beta1.

There are some details of the configuration we will see today that are done in a certain way on CDK and may be slightly different on clusters that are self hosted. I will try to mention them whenever possible. In any case, feel free to ask questions in the Q&A.

RBAC Configuration

NOTE: this applies to CDK, and will apply to GKE when the API Aggregation is GA with K8s 1.9.

On CDK and GKE the default user is not a real admin from an RBAC perspective, so you need to update it before you can create other Cluster Role Bindings that extend your own role.

First make sure you know your identity with

CDK: you are the user admin

GKE: your GCP user is the admin

Now grant yourself the cluster-admin role:

You can then check it out with:

Helm Install

In order for Helm to operate in a RBAC cluster, we need to add it as a cluster-admin as well:

OK, we are done with prepping our cluster

Note: By default Helm deploys without resource constraints. When trying to surcharge a cluster and maximize its usage, it means that Tiller will be part of the pods that may go away because resources are exhausted. If you do not want that to happen, you can edit the manifest and reapply it:

AutoScaling: Preparing the cluster

Introduction & References

First of all, I recommend you have a look at the documentation about extending Kubernetes.

Then also look at the documentation about extending Kubernetes with the Aggregation Layer.

The last theorical doc is about setting up an API Server, and you can find it here.

Once the aggregation is active, we will deploy 2 custom APIs: the Metric Server and a Custom Metric Adapter

OK now that you are versed in what we need to do, let’s get started.

Configuring the control plane

In order to activate the Aggregation Layer, we must add a few flags to our API Server:

You will note that we do not activate the flags

This is because the proxy in CDK uses a Kubeconfig and not a client certificate.

However, we do enable the aggregator routing because the control plane of Kubernetes is not self hosted and we fall in the case “If you are not running kube-proxy on a host running the API server, then you must make sure that the system is enabled with the enable-aggregator-routing flag”.

Also we added the client-ca-file flag to export the CA of the API server in the cluster.

Now for the Controller Manager, we must tell it to use the HPA, which we do with:

Note that the 2 last options here are really for demos to make it quick to observe the results of actions. You may not need change them for your use case (they default to 3m and 5m).

Just to make sure the settings are applied restart the 2 services with

This will make Kubernetes create a configmap in the kube-system namespace called extension-apiserver-authentication, which contains all the additional flags we generated and their configuration. You can have a look at it via

Each of API servers will now need to have an RBAC authorization to read this Config Map. Thankfully K8s will also automatically create a role for it:

Last but not least, you won’t need Heapster for now, so make sure it is not there via:

Initial API State

Before you start having fun with API Servers, have a look to the status of your cluster with:

At the end of the next sections, you will have 3 more APIs in this list:

monitoring.coreos.com/v1 , for the Prometheus Operator

, for the Prometheus Operator metrics.k8s.io , for the Metrics Server that collects metrics for CPU and Memory

, for the Metrics Server that collects metrics for CPU and Memory custom.metrics.k8s.io, for the custom metrics you want to expose

Adding the Metrics Server API

There are 2 implementations of the Metrics API (metrics.k8s.io) at this stage: Heapster and the Metrics Server. At the time of this writing, the Metrics Server has a simple deployment method, while Heapster required some work on my end, and I was too lazy to write the code.

We can simply deploy it with

This manifest contains:

the Service Account for the Metrics Server

a RoleBinding so that the Metrics Server can read the configmap above

a ClusterRoleBinding so that the Metrics Server inherits the system:auth-delegator ClusterRole (you can find documetation about that here.

a Deployment and ClusterIP Service for the Metrics Server

an APIService object, which is a registration of the new API into the API Server.

Now check our APIs again:

Awesome… But does it really work? Query the endpoint of the API via kubectl to make sure

Good job, NodeMetrics and PodMetrics are exposed. Look into what you can use from there:

And

No big surprise here, you can access the CPU and memory consumption in real time. Refer to the docs for more details about how to query the API.

Installing the Custom Metrics Pipeline

Right before we have taken a shortcut, having a metric pipeline that is directly exposable as the aggregated API. Unfortunately, in the case of custom metrics, we must do this in 2 distinct steps.

First of all we must deploy the custom metrics pipeline, which will give us the ability to collect metrics. We use Prometheus for that part as the canonical example of metrics collection system on K8s.

Then we will expose these metrics via a specific API Server. We will use the work of Sully (@DirectXMan12) that can be found here for that.

Prometheus has many installation methods. My personal favorite is the Prometheus Operator. It takes a lot of efforts to architect a piece of software using traditional solution. But crafting a software model that ties to the underlying distributed infrastructure beautifully is closer to art than to anything else.

That is essentially what the operator is. The operator models how Prometheus should be given a set of conditions, then realizes that in Kubernetes. Whaow, good job @CoreOS.

Note that you can create an Operator for anything, and that something similar is coming for Tensorflow as far as I can see the APIs coming up… Anyway, let’s not get distracted.

Install the Prometheus Operator with:

This contains:

a Service Account for the operator

a ClusterRole and ClusterRoleBinding that are fairly extensive, so that the Operator can deploy Custom Resource Definitions for Prometheus (instances of), Alert Managers and Service Monitors.

a Deployment for the Operator pod.

This will let the Operator add the monitoring API as well:

Now create an instance of Prometheus with:

The RBAC manifest will allow Prometheus to read the metrics it needs in the cluster and /metrics endpoints of any object (pod or service). The Prometheus manifest defines an instance and a service to expose it as a nodePort (so we can have a look at the UI).

What is important in this second file is the section:

This essentially dedicates the Prometheus instance to Service Monitors with this label (or set of labels). When we will define the applications we want to monitor and how, we will need that information.

Note that this is a trivial example of deployment, with no persistent storage or any fancy thingy. If you are contemplating using this for a more production grade usage, you will need to spend some time on this.

OK, now you can connect on the UI and check that you have everything deployed correctly. It is pretty empty for now…

Installing the Custom Metrics Adapter

Now that we have the ability to collect metrics via our Prometheus pipeline, we want to expose them under the aggregated API.

First of all, you will need some certificates. Joy. This is all documented here. Run the following commands to generate your precious:

In order to authenticate our extended API server against the Kubernetes API Server, we have several options:

Using a client certificate

Using a Kubeconfig file

Using BasicAuth or Token authentication

Adding users with certificates in CDK is a project in itself and would deserve its own blog post. If interested, ping me in questions and we can discuss this in DMs. BasicAuth and Tokens are easy, but they also require to edit /root/cdk/known_tokens.csv or /root/cdk/basic_auth.csv on all masters and restart the API server daemon everywhere.

So the solution with the least complexity is actually the Kubeconfig file. Thanks to RBAC, the only thing we need to create a new user is a service account, which will give us access to an authentication token, which we can then put into our kubeconfig.

You can then create a copy of your .kube/config file and edit the user section to add the custom-api-server:

Do not forget to also edit the contexts to map to this user instead of admin.

Now edit a cm-values.yaml file for the helm chart:

OK you are now ready to download the chart and install it

And we check that the new API is registered in Kubernetes:

Great. Now let us check that everything works properly by querying the K8s endpoint for it:

At this point of our development, if we get a 200 answer and NOT a 404 we are good and we have our autoscaling up & running. If you get a 404, then it did not work properly.

Summary

In the long section above, we have done the following

Install the new Metrics Server and add the metrics.k8s.io API to the cluster. This gave us access to an equivalent of heapster, but exposing metrics under the classic API. Install a Custom Metrics pipeline to be able to collect any metrics. We did this via Prometheus, using the Operator to create a Prometheus Instance Install the Custom Metrics API custom.metrics.k8s.io via the installation of a Prometheus Adapter.

For each step, we validated that the cluster worked properly and as intended. Now we need to put it to good use.

Using Custom Metrics

Demo Application: http_requests

First of all we will test our set up with a very simple application written by @luxas that exposed a http_request metric on /metric. You can deploy it with

This manifest contains:

the Deployment and Service so we can query the application

a Service Monitor, which will indicate to the prometheus instance that it should scrap the metrics of the application

An Horizontal Pod Autoscaler (HPA), which will consume the number of http_requests and use it to scale the application.

Let us look into the HPA for a moment:

As you can see, we have here

a Target (our deployment), with a minReplicas and a maxReplicas.

a metrics of type pod which tries to make sure that pods get an average 500m queries (which slightly above what the standard load from Kubernetes + Prometheus is)

So this means that you do not need the application to rely on its own metrics. You could potentially target any application metrics and use them to manage another application. Very powerful principles.

Let us say for example that you manage an application based on the principle of decoupled invocation, such as a chat or an order management solution. Some day, you start getting a peak of requests on the front end, and the backend does not follow. The queue fills up, and you start experimenting delays in processing of the requests. Well now you can scale the workers that process the queue based on the requests made on the front end. You create a target object that monitors the number of http_requests on the frontend, but the scale target may be your application. It is as simple as that.

Now look at how Custom API reacts to this (it may take a couple of minutes before this works)

And we can then query the service itself:

And we then look at our HPA:

We can see here that just the requests for status account for 433m per pod which is a little more than a request every 400ms. Now deploy a shell app so we can create some load:

Now prepare 2 shells. In the first one, connect into your container with

And in the second one, track the HPA with

and we have successfully triggered an up scale and downscale of an application

Main Application: Optimizing the infrastructure

OK! Now we have an application that can generate load in our cluster and consume resources. Now let us look at how to use the remaining resources.

So first of all let’s look at our metrics harvesting application Ronan wrote. It exposes the following metrics:

In addition, he created a nice UI that presents the values in real time.

This requires a Grafana installation. You can install both apps with:

These manifests contain

Cluster Roles and bindings for collecting metrics

Deployments for both Grafana and the python application

Nodeports services on ports 30505 (app) and 30902 (Grafana)

Config maps to configure both

What is of interest to us in this example is the “cpu_capacity_remaining”. As mentioned in the intro, I had access thanks to Kontron to a 184-core cluster. I decided to “reserve” 30 cores, or 15% of my capacity to give room for load peaks. This gave me an autoscaler looking like:

You will note I am using Electroneum as my crypto. The reason for this is practical. It is a very new cryptocurrency, with limited mining resources allocated to it right now, which means you can directly measure your impact and see daily returns, which is cool for demos. In case you wonder, as this currency requires a Monero miner, this setup can easily be converted into something more lucrative by pointing it to a real monero pool.

To replicate this blog with your own machines, edit the src/manifest-etn.yaml file according to your own cluster then deploy with :

This manifest contains:

a Deployment of the miner

a Horizontal Pod Autoscaler as seen above

a service to expose the UI on port 30500 of nodes.

Now let us check on our HPAs with:

Alright, we are all set! Now we can finally check how our application reacts to load.

Opportunistic Load Balancer in motion

In order to supercharge our cluster, we reuse our shell-demo application, and generate 10 hits per second on the API for 5 min. Because we are expecting only 0.5 hits, this will quickly trigger the scale out:

There you go, we can see the new pod coming in. Each new pod requests 4 CPU cores to the cluster. This unbalances the HPA, that tries to counter by releasing miners. Over 5 minutes, our app will scale up to 17 replicas, thus claiming 68 cores to the cluster, which will be freed from the mining app. After 5 minutes, the load is now normal and we see a scale down of the simple app from 17 pods down to its stable version at 2 replicas. The HPA for the miner will react and start harvesting the capacity.

This can be seen in the UI on the CPU capacity graph

We now have an application that is opportunistically adjusting to the load created by other applications in the cluster all by itself. To see a little better how the HPA behaves, we can look directly at Grafana:

Cluster perfectly auto adjusting to business load

Here you can clearly identify the peaks of load on the second graph in green, how the HPA reacts by scaling the number of API replicas. On the top graph, we can see the blue area (business load) going up, and shortly after the yellow line going down (this is the opportunistic app scaling down), the red "remaining CPU cores" fluctuating, while the total (yellow + blue + red) is about constant, representing the total number of cores in the system (184).

I should share a potatoe as this post was really really long.

Some thoughts about the HPA

Keep the non business load low

While creating this blog, I had a huge hard time configuring the HPA to make it stable and convergent and not completely erratic. One must understand that the HPA in K8s is, so far, pretty dumb. It is not exactly learning from the past, rather systematically repeating the same reaction patterns regardless of the fact they failed or succeeded in the past.

Let’s say a custom metric is at 150% of its target value, then the cluster will perform a 150% capacity increase. This means that if your application is creating 2% HPA resource value for 1% increase of scale, you will enter into a turbulence zone, with the HPA proving incapable of converging because it is always overreacting to the environment.

Because of that behaviour, if the opportunistic load represents the majority of your total cluster load, you have a risk of generating an ever fluctuating, sub-optimal HPA. Below is an example where the mining rig varies between 15 and 120 cores (60% of the cluster), while the business load is only ~20%. Under these conditions, the cluster takes too long to converge and, effectively, sometimes never does so.

Cluster in a non convergent state

Long story short: NEVER, EVER use an HPA that can diverge!! Experiment and learn, keep the influence of the HPA reasonable in the cluster.

Total Failure

So this is something I was not able to debug completely.

In the last Grafana screen above you can see that there is a longer peak of high load in the second load surge. In effect, the HPA got stuck and for some reason would never go back until you force it to.

From my experience, this only happens when an HPA fluctuates greatly then reaches its max. At this point, if that condition lasts for too long, it will then fail to downscale afterwards, thus effectively crashing.

Again, when building an HPA, do some experiments. Test your metrics, make sure they work well together.

Conclusion

I always dreamt of building the “Opportunistic Autoscaler”. For the first time of my life, thanks to Ronan, Kontron, and the awesome work done by the community on Kubernetes and Canonical on CDK, I was able to put it together. And it "just works"!

At the beginning of the post, we wanted to add value by either reducing costs or increase revenues. Depending on your opportunistic app, you may be in either or both use cases.

For sure, over the course of the month this was up and running, we managed to

average ~25 free CPU cores, from a target of 24 in the HPA , creating random load every 30min.

Opportunistically consume an average of 30 cores, which would have been lost otherwise.

Does that make money with mining cryptos? Not much. We mined about 1000 ETN while testing this, for a total value of about $100. More than nothing but not a lot.

But now think Serverless and do the math. 30 cores is about 1 sled of the system free at any time. Assuming this also translates into the same amount of RAM being free:

A sled can have up to 256GB RAM.

For Lambda, AWS charges 0,00001667$/GB-s plus a bit for the invocations

There are 86400x365,25 = 31 557 600 s / yr

soooo… 256 * 0,00001667 * 31 557 600 = $134 672,69

$134 672,69 is, if it was permanently running Lambda 100% all the time, the business value of the very sled we just used. Not bad for an "unused" resource.

Does that give you some ideas?

References

I would like to shout a special thanks to @Luxas and @DirectXMan12 for inspiring this work and for the fantastic walkthrough they wrote here and there that helped me a lot while writing this.

The code for the UI is here: https://github.com/ronhanson/crypto-miner-webui

The source and manifests for K8s are here: https://github.com/madeden/blogposts/tree/master/k8s-autoscaling