To remain valuable and relevant to their lines of business (LOBs), in-house IT organizations have to be able to deliver private cloud services that are competitive with public cloud services. Correspondingly, the community has been developing private cloud solution such as OpenStack, which mimic public cloud services by offering virtual machines and other resources. However, corporate are realizing they could never build an AWS-like cloud in-house. In 2018, Walmart, the biggest cheerleader of OpenStack, announced that they signed a five-year deal to make Microsoft Azure the preferred cloud provider. Although Walmart won’t leave OpenStack completely in the dust, it does raise questions about Walmart’s future use of OpenStack.

The idea of private cloud may not be dead yet but it has definitely gone through near-death experiences. There are plenty of articles declaring that private cloud is dead. The analysis goes from the technical difficulties, operational challenges, slowness of enterprise IT processes, to the economics of public cloud, etc. However, it is very difficult for a large enterprise to go all-in to public cloud. The value of private cloud is still undeniable. So there are new solutions such as AzureStack to deploy an on-premises public cloud solution. Hipsters also talk about that containers and Kubernetes will rule the hyperscale data centers (check out my previous post why Kubernetes is not ready for that).

Unfortunately, all these efforts are not sufficient unless we understand the fundamental problems with corporate data centers. To find out why, let’s start with this video, an interview with actors Gene Hackman and Dustin Hoffman.

If you cannot access Youtube for whatever reason, here is the story. Hoffman asked to borrow money from Hackman when the two men were broke young actors living in Los Angeles. Hackman went to his friend’s apartment and saw on a shelf several jars labeled with various household expenses: rent, electricity, etc. They were all stuffed with cash, except the empty one labeled “food,” which Hoffman wanted money to fill.

This interesting phenomenon is called mental accounting, a tendency people have to separate their money into different accounts based on miscellaneous subjective criteria, including the source of the money and the intended use for each account.

Richard Thaler, the 2017 Nobel laureate in economics, has been studying mental accounting intensively. Thaler points out that mental accounting violates the economist’s basic assumption that money is fungible. While many people (probably all of us including smart economists) use mental accounting in some way, we may not realize that this line of thinking often results in an irrational and detrimental set of behaviors. For example, some people keep a special “money jar” for a vacation or a new home while at the same time carrying substantial credit card debt.

Mental accounting happens in corporate data centers too. When you walk into a data center of Fortune 500 company, you will find thousand servers in rows of racks. They all look similar, right? Indeed, most servers run Linux on Intel processors. Just like money, general purpose computers are fungible by design. If you can run a program on a computer, you should be able to run it on any other computers with same hardware and software specs.

In reality, however, you cannot run your application on another server because each server is “labeled” based on budget source or intended use (HR, Finance, FICC, Risk, Hadoop, Cassandra, etc.). When you need more computing power, you have to go through a painful procurement process (6 month if your are lucky) while many servers in other silos are idle in most time. Because machines are allocated into silos, the overall utilization is also terribly low while each silo may suffer limited resource at peak time. It takes billion dollars to build a modern data center and costs hundred million dollars to operate it. Low utilization simply means that we burn hundred million dollars every year while people rarely notice it.

Why enterprise data center are operated in this way? Budget and service-level agreement (SLA). Budgets serve as a crude way to keep costs under control while giving employees discretion to spend as they see fit. IT departments typically charge back LOBs monthly based on cluster size because it makes budget planning predicable and also simplifies the internal payment process. Meanwhile, LOBs care the SLA first and most. During the budget planning season, they would like to fight really hard to secure sufficient fund to own large enough clusters to meet the SLA at the peak time. To guarantee the SLA, they also rarely want to share the clusters with other LOBs to avoid any potential interference. In result, huge data centers are carved into hundreds clusters (silos) with little resource sharing. Moreover, enterprise applications are rarely busy 7 x 24. In fact, many internal applications are batch jobs and run on schedule. Therefore, the overall utilization stays very low. But data centers become bigger and bigger to accommodate more applications. Although budgets and SLA exist for sensible, understandable reasons, they lead to silly outcomes such that IT infrastructure and cost grow out of control.

Now it is clear why most private cloud projects failed. OpenStack or AzureStack is introduced for additional technical capabilities (e.g. self-service VMs, managed database, object storage, etc.). They are not used to solve the aforementioned fundamental problems. In fact, they make situation even a little bit worse as they are deployed as yet another cluster and operate on dedicated budgets.

Since we find the root causes of inefficiency of corporate data centers, we now have opportunities to get private cloud right. But first we need to figure out what exactly is cloud computing. Both Amazon AWS and Microsoft Azure agree on that cloud computing is the delivery of computing services (VMs, storage, etc.) over the Internet (“the cloud”). But what they try to sell is neither “the cloud” nor computing power. What they really sell are

Elasticity — Ability to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each time point the available resources match the current match demand as closely as possible. Remember AWS EC2 stands for Elastic Cloud Compute.

— Ability to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each time point the available resources match the current match demand as closely as possible. Remember AWS EC2 stands for Elastic Cloud Compute. Agility — Elasticity requires that vast amounts of computing resources (VMs) can be provisioned in minutes. It could be down to seconds/milliseconds if the resource orchestration units are less heavy than VMs. Agility is the competitive advantage for a digital age.

— Elasticity requires that vast amounts of computing resources (VMs) can be provisioned in minutes. It could be down to seconds/milliseconds if the resource orchestration units are less heavy than VMs. Agility is the competitive advantage for a digital age. Pay-as-you-go pricing — Without usage based charge, elasticity makes no sense at all. Just like mobile plans, all-you-can-eat monthly plans expect people to use less while usage based plans encourage people to use smartly.

All other selling points of cloud computing are secondary to giant enterprises. For example,

Cost — With cloud, startups can avoid the heavy front cost to buy servers. But enterprises avoid the front cost by leasing servers in their data centers. Besides, they have strong bargaining power in procurement negotiation.

— With cloud, startups can avoid the heavy front cost to buy servers. But enterprises avoid the front cost by leasing servers in their data centers. Besides, they have strong bargaining power in procurement negotiation. Scale — The data centers of Fortune 100 are no smaller than those of public clouds. The global enterprises have global data center footprint too.

— The data centers of Fortune 100 are no smaller than those of public clouds. The global enterprises have global data center footprint too. Productivity — By leveraging cloud, startups can focus on developing their core applications and minimize the infrastructure team. Enterprise IT departments have a strong team of skilled technologists. The slowness of enterprise IT process is often due to the regulatory and compliance requirements rather than lacking of skills. Moving to cloud can hardly improve productivity without reengineering the IT processes. Regulatory and compliance requirements cannot be relieved anyway.

— By leveraging cloud, startups can focus on developing their core applications and minimize the infrastructure team. Enterprise IT departments have a strong team of skilled technologists. The slowness of enterprise IT process is often due to the regulatory and compliance requirements rather than lacking of skills. Moving to cloud can hardly improve productivity without reengineering the IT processes. Regulatory and compliance requirements cannot be relieved anyway. Security — Security is a must have, no matter on-premise or cloud. No excuse.

The business goal of private cloud is to introduce elasticity, agility, and usage based payment model. Equipped with our analysis results, it is clear that enterprises cannot achieve these goals by deploying some cloud technologies but without the changes of mindset. To build a successful private cloud, enterprise should follow the below principles:

Manage a data center as the whole rather than 300 individual clusters . Public clouds, as service providers, naturally manage their data centers as a set of fungible computers. Enterprise IT departments are service providers too. However, their SLAs are defined at application/cluster level with each corresponding LOB. To operate as a private cloud provider, IT departments have to change their mindset and define the SLA at the data center level for the whole firm.

. Public clouds, as service providers, naturally manage their data centers as a set of fungible computers. Enterprise IT departments are service providers too. However, their SLAs are defined at application/cluster level with each corresponding LOB. To operate as a private cloud provider, IT departments have to change their mindset and define the SLA at the data center level for the whole firm. Build a usage based back charge system . Elasticity is meaningless without a usage based pricing system. The cost effectiveness of cloud comes from the behavior changes nudged by the model of service fee. Even if a cloud can provide unlimited computing power, no one has unlimited budget. To reduce the cost, we have to change the way of consuming computing power, driven by the usage-based service fee.

. Elasticity is meaningless without a usage based pricing system. The cost effectiveness of cloud comes from the behavior changes nudged by the model of service fee. Even if a cloud can provide unlimited computing power, no one has unlimited budget. To reduce the cost, we have to change the way of consuming computing power, driven by the usage-based service fee. Develop a data center operating system . To achieve elasticity and agility, we have to develop an efficient, scalable, and fault-tolerant operating system that manages the data center resource globally. It is aware of the workload and resource demand of each application, tracks the available resources, dynamically allocates/deallocates resources to/from applications in real time, efficiently manages priority and preemption to meet SLA, centralized logging for debugging, troubleshooting, and reporting, etc. It should be able to orchestrate diverse workloads such as HPC, data analytics, batch jobs, web services, long run services, etc. Although it is very challenging for the DCOS to meet the different latency and throughput requirements of various workloads, we can achieve it with the rich research and development results in recent years.

. To achieve elasticity and agility, we have to develop an efficient, scalable, and fault-tolerant operating system that manages the data center resource globally. It is aware of the workload and resource demand of each application, tracks the available resources, dynamically allocates/deallocates resources to/from applications in real time, efficiently manages priority and preemption to meet SLA, centralized logging for debugging, troubleshooting, and reporting, etc. It should be able to orchestrate diverse workloads such as HPC, data analytics, batch jobs, web services, long run services, etc. Although it is very challenging for the DCOS to meet the different latency and throughput requirements of various workloads, we can achieve it with the rich research and development results in recent years. Application-oriented infrastructure. Every company is a software company today. We win the competitions by applications, not by machines. Containerization enables the data center operation teams to transform from being machine-oriented to being application-oriented. The shift will dramatically improve application deployment and introspection and help enterprises compete in fast pace.

It is really hard to move legacy mission-critical applications to public cloud. Wait for a slow death? Or fight disruptive innovations now with elastic and agile private clouds? Enterprise decision makers have to ask themselves this question.

This is also an enormous opportunity for solution providers and consulting firms. Fortune 100 companies typically operates 3 and more data centers. The market of private clouds is way bigger than the public cloud in combine. It is waiting for a mindset-shifting solution.