Imagine shopping in a store with no price tags, where all you can see is the total bill after you go through the check out. At some point in time, you might realize that you’ve spent too much, and then decide to spend less next time. You try buying other brands, smaller packages, but the final bill is just not getting much smaller.

This is what it’s like to manage Kubernetes clusters with no cost accounting. You keep adding/removing applications, optimizing various parameters, sometimes adding new tenants into the cluster and still, the surprise bills keep hitting you at the end of the month. It is time to cut back, but where? If you want to match what you need with the costs, you would need to put price tags on all items in your cluster. The process of assigning price tags to pods, deployments, services, and namespaces in a cluster is called cost accounting.

Back in the time of one to one deployments, a single application would run on a single server. The cost of that server would be effectively the cost of that application. Everything was simple. With the introduction of containers we are getting out of that era, and that’s a good thing. Running multiple pods/containers per server makes a lot of sense from an economic standpoint. It lowers the server fleet footprint, and thus improves the cost of running an application. Kubernetes orchestrates those containers into larger logical blocks and hides a lot of the low-level complexity of managing such an environment. It all sounds great, but you still have to manage it at a higher level. Now cost accounting is even more critical. Why? Well, here are four reasons.

Understand what you’re buying

Before you your financial department get surprised by the next bill from Amazon or Google, you need to understand the actual costs of cluster components. Or, put more simply, what did you spend our money on?

The spending categories can be viewed from two perspectives: from the perspective of the resources and from the perspective of the business function of a specific component. If the costs for each class of resources is usually given to you by the cloud providers, the split of those between the components is something you will need to work out yourself. In future posts I plan to describe some simple mechanisms you can use for doing that. So let’s define the categories. The charges for the consumed resources can be split into three different categories:

compute, which consists of the charges for the nodes

storage, such as costs of the Persistent Disks or EBS volumes

networking, which will include external traffic charges, load balancing, cross AZ charges and so on

Logical components of Kubernetes clusters can be divided into:

admin, which includes etcd, API server, scheduler and everything you can find in the kube-system namespace

monitoring, which includes the metrics stack, e.g. Prometheus

logging, e.g. ELK

applications — all the pods which handling the user workload

idle — the resources, like CPU and memory which were provisioned but not allocated or used

Once you summarize the answers in a chart like the one above, for example using this template, many things about the cluster become apparent right away. For instance, in this case, it is evident that a large chunk of money was spent on idle resources. Also, it is quite apparent that the cost of the storage for logging is on par with the cost of the storage for the user pods themselves. Perhaps running fewer nodes and trimming the logs a bit might reduce the monthly bill by a few thousand dollars.

Produce realistic budgets

For many years, I’ve watched people successfully applying the “coin toss”, also known as the “guesstimate” method, for estimating the budget of their clusters and applications. Even though this is a time-tested method which is well used across the industry, it works in only in one scenario — when no-one really cares about the budget. One legitimate example of this is organizations where the infrastructure cost is negligible when compared to the rest of the expenses. In other scenarios, this method has a number of shortcomings, i.e., it sucks! Here’s why.

There are two kinds of “guesstimate” budgets. Let’s call the first one the “bloated guesstimate.”, In this budget, costs are intentionally increased by some safety margin every year, to make sure you stay on the safe side and don’t run over budget. This works until the moment you realize that the money you’ve spent on that margin and did not use might have been better spent on something more valuable, like bonuses.

The other kind of budget is the “deflated guesstimate” budget. This is when the budgeted amount is lower than it should be. This makes you look good during annual budget planning meetings. When that happens, make sure you enjoy that moment, as for the rest of the year you will be trying to squeeze your actual costs into that budget.

In either case, “guesstimates” won’t make you look good in front of the rest of the organization in the long run.

Having realistic budgets for a cluster and its applications helps with directing the organization’s money where it is needed and makes everyone’s lives easier.

Discourage overspending by teams

Depending on the type of organization you are working for, there could be multiple teams using the clusters. Very often developers face tight deadlines and need to deliver the software in short time frames. With pressure like that, usually everyone focuses on the stability of that software, rather than optimizing the amounts of resources it uses. That’s understandable; business objectives always come first. But still, in situations like that, it is essential to understand the actual cost of such a scramble. The only way you can do that is by having fair and accurate cost accounting in place. By fair, I mean that every team sees the actual cost of running their services. Quite often, I’ve come across cases where the cost per team is calculated using guesses, while ensuring that their total always adds up to the total cost of the cluster. The guess numbers quite often are derived from previous guesses, gut feelings and a bit of randomization.

In any case, using guesswork to estimate the cost of the applications is most likely to make the resource-hogging applications look a lot better then they are in reality. Arbitrary division of the cluster costs forces the other tenants to absorb the cost of a new application hastily developed without regard for efficiency. As a result, the actual cost of rushing an app to production is hidden. It might not seem like a big deal at first, but as the time progresses, it can turn into a full-blown tragedy of the commons. Without seeing the actual monetary costs, teams are encouraged to keep sacrificing the efficiency of their applications for velocity. Once such an attitude starts propagating through the rest of the company, be prepared for a rapid increase in the clusters’ cost.

Once you start reporting the real costs per application, there is a high chance that some people will start scratching their heads, as the cost for some shortcuts they’ve taken turns out to be a lot greater than anyone has expected. Or it might be the other way around: the inefficiencies you have been worried about might not matter that much after all.

Fix only what’s worth fixing

Know the cost of each component and spend your time wisely; optimize only what is worth optimizing.

It is easy to get excited about improving the efficiency of the things you know you can improve. But is it worth it? Is this what you should really focus on right now? Well, it is hard to say. You need to do a cost-benefit analysis first. And it is impossible to analyze any improvement of the costs of the clusters unless you see a clear picture of the cost of the each of their components.

If you know the cost of the component, you can gauge the potential benefit of improving the cost of that component and can easily compare it to the cost of the required work.

For example, let’s say you’ve read a great article about storage optimization in the new version of your metrics stack and cannot wait to roll it out. Apart from that, there is no other reason to switch to a newer version. You’ve estimated that the cost of the upgrade is 4 days, and it would halve the storage used by the metrics stack. Let’s also say that your time costs the company about 100$/hour. So the cost of the upgrade is 4 days x 8 hours/day x $100 = $3,200, while the potential savings are $100/month or $1,200/year. Clearly, in the presence of other tasks, this is not the best investment. At the same time, the cost of the idle resources is $5,000/month, and the fix is as simple as shutting down the few extra nodes — which won’t take even an hour.

Now it is pretty clear which tasks you might want to focus on first, isn’t it?

What’s next?

In the following post, I intend to share spreadsheets which you can use for basic cluster cost accounting in your organization. A spreadsheet like that would be sufficient for smaller organizations.

For larger setups, I would suggest checking out untab.io, which I am working on right now. Untab can perform cost accounting for Kubernetes on a cluster/namespace/service/deployment/pod level, attribute costs to teams, and gauge the efficiency of the resource allocation and use.