By Zeng Fansong, senior technical expert for the Alibaba Cloud Container Platform, and Chen Jun, systems technology expert at Ant Financial.

This article will take a look at some of the problems and challenges that Alibaba and its ecosystem partner Ant Financial had to overcome for Kubernetes to function properly at mass scale, and will cover the solutions proposed to the various problems the Alibaba engineers encountered. Some of these solutions include improvements to the underlying architecture of the Kubernetes deployment, such as enhancements to the performance and stability of etcd, the kube-apiserver, and kube-controller. These were all crucial for Alibaba to ensure the support needed for the 2019 Tmall 618 Shopping Festival to take full advantage of the 10,000-node Kubernetes cluster deployment. They are also important lessons for any enterprise interested in following Alibaba’s footsteps.

Starting from the very first AI platform developed by Alibaba in 2013, Alibaba’s managing cluster has gone through several stages of evolution before it went completely Kubernetes in 2018. In this article, we’re going to specifically take a look at some of the challenges of applying Kubernetes on such a mass scale, and some of the optimizations and fixes necessary to make everything work properly.

The production environment of Alibaba consists of more than 10,000 containerized applications, with the entire network of Alibaba being a massive system using millions of containers running on more than 100,000 hosts. The core e-commerce business of Alibaba runs on more than a dozen clusters, the largest of which includes tens of thousands of nodes. Many challenges had to be solved when Alibaba decided to go all in on Kubernetes, implementing it throughout the whole of their system. One of the biggest challenges, of course, was how to apply Kubernetes in an ultra-large scale production environment.

As the saying goes, Rome was not built in one day. To understand the performance bottleneck of Kubernetes, the team of engineers at Alibaba Cloud needed to estimate the scale of a cluster with 10,000 nodes based on the production cluster configuration, which was as follows:

200,000 pods

1,000,000 objects

For more details, consider this following graphic:

The team built a large cluster simulation platform based on Kubemark to simulate a kubelet with 10,000 nodes based on 200 4-core containers by starting 50 Kubemark processes through one container. When running common loads in the simulated cluster, the team encountered extremely high latency, up to 10s in fact, for some basic operations such as pod scheduling. Moreover, the cluster was also unstable.

To summarize the information covered in the graphic above, what the team noticed was that, with Kubernetes clusters with 10,000 nodes, several system components experienced the following performance problems:

etcd encountered read and write latency issues, which lead to a denial of service, with a large number of Kubernetes-stored objects unable to receive support due to space limitations. The API server encountered extremely high latency when querying pods and nodes, and the back-end etcd sometimes also ran out of memory due to concurrent query requests. The controller encountered high processing latency because it couldn’t detect the latest changes from the API server. For abnormal restarts, service recovery would often take several minutes. The scheduler encountered high latency and low throughput, and was unable to meet the routine O&M requirements of Alibaba to support the extreme performance nessary for the shopping festivals run on Alibaba’s various e-commerce platforms.

Improvements to etcd

To solve these problems, the team of engineers at Alibaba Cloud and Ant Finanical made comprehensive improvements to Alibaba Cloud Container Platform so to improve the performance of Kubernetes in these large-scale applications.

The first system component they improved was etcd, which is the database that stores objects for Kubernetes, having an important role in a Kubernetes cluster.

In the first version, the total storage capacity of etcd was improved by dumping the data stored in etcd to a tair cluster. However, this had the disadvantage that the security of the Kubernetes cluster could be hampered by the complex O&M requirements of the tair cluster, which uses a data consistency model that is not based on a Raft replication group.

In the second version, different types of objects on the API server were stored in different etcd clusters. Then, different data directories were created within etcd, and data in different directories was routed to different etcd back ends. This helped to reduce the total data volume stored in a single etcd cluster and improve overall extensibility.

In the third version, the team studied the internal implementation principle of etcd and found that the extensibility of etcd was mainly determined by the page allocation algorithm of the underlying bbolt db. As etcd stores increased the amount of data, bbolt db experienced worse performance in linearly searching for the storage page with a run length.

To solve this problem, the team of engineers at Alibaba designed an idle page management algorithm based on segregated hashmap, which uses the continuous page size as its key and the start page ID of continuous pages as its value. Searching for O(1) idle pages could be done by querying segregated hashmap, with improved performance. That is, when a block is released, the new algorithm tries to merge pages with contiguous addresses and to update segregated hashmap. For more information about this algorithm, check out this CNCF blog.

This new algorithm helped increase the etcd storage space from the recommended 2 GB to 100 GB. This in turn greatly improved the data storage capacity of etcd and avoided a significant increase in read and write latency. The Alibaba team of engineers cooperated with Google engineers to develop the etcd Raft learner similar to ZooKeeper Observers), which is equipt with fully concurrent read, along with a load of other features to improve data security and read and write performance. These improvements were contributed to open source projects and will be released in community etcd 3.4.

Improvements to the API Server

In this following section, we will look at some of the challenges that the engineering team faced when it came to optimizing the API Server and look at the solutions they proposed. For this section, we will need to go into quite a few technical details, as the engineering team addressed several issues.

Efficient Node Heartbeats

The scalability of Kubernetes clusters depends on the effective processing of node heartbeats. In a typical production environment, the kubelet reports a heartbeat every 10s. The content connected with each heartbeat request reaches 15 KB, including dozens of images on the node and a certain amount of volume information. However, there are two problems with all of this:

The heartbeat request triggers the update of node objects in etcd, which may generate nearly 1 GB of transaction logs per minute in a Kubernetes cluster with 10,000 nodes. The change history is recorded in etcd. In a node-packed Kubernetes cluster that involves high CPU usage of the API server and huge overhead of serialization and deserialization, the CPU overhead for processing heartbeat requests exceeds 80% of the CPU time usage of the API server.

To solve these problems, Kubernetes introduced the built-in Lease API to strip heartbeat-related information from node objects. Consider the above graphic for more details. The kubelet no longer updates node objects every 10s, and it makes the following changes:

It updates Lease objects every 10s to indicate the survival status of nodes. The node controller determines whether nodes are alive based on the Lease object status. In consideration of compatibility, the cluster updates node objects every 60s so that EvictionController works based on the original logic.

The cost of updating Lease objects is much lower than that of updating node objects because Lease objects are extremely small. As a result, the built-in Lease API greatly reduces the CPU overhead of the API server and the transaction logs in etcd, and scales up a Kubernetes cluster from 1,000 nodes to a few thousand nodes. By default, the built-in Lease API is enabled in Kubernetes 1.14, see KEP-0009 for more details.

Load Balancing for the API Server

In a production cluster, multiple nodes are typically deployed to form high-availability (HA) Kubernetes clusters for performance and availability considerations. During the runtime of an high-availability cluster, the loads among multiple API servers may be unbalanced, especially in the occasion that the cluster is upgraded or nodes restart due to faults. This imbalance creates a great burden for the cluster’s stability. In extreme situations, the burden on the API servers that was originally distributed through high-availability is concentrated again on one node. This slows down the system response and makes the node unresponsive, causing an avalanche.

The following figure shows stress testing on a simulated triple-node cluster. After the API servers are upgraded, the entire burden is concentrated on one API server, whose CPU overhead is far higher than the other two API servers.

A common way to balance loads is to add a load balancer. The main load in a cluster comes from the processing of node heartbeats. Therefore, a load balancer can be added between the API server and kubelet, in two ways:

Add a load balancer on the API server side and connect all Kubernetes clusters to the load balancer. This method is applied to Kubernetes clusters delivered by cloud vendors. Add a load balancer on the kubelet side and enable the load balancer to select API servers.

After performing some stress testing, the team of engineers found that adding a load balancer was not an effective method for balancing loads. They realized that they needed to first understand the internal communication mechanism of Kubernetes before they could move any further. After much deep research into Kubernetes, the team found that the Kubernetes clients have made many attempts to reuse the same TLS connection to control the overhead of TLS connection authentication. In most cases, client watchers work over the same TLS connection at the lower layer. Reconnection is triggered only when this TLS connection is abnormal, which leads to API server failover. In normal cases, load switching does not occur after the kubelet is connected to an API server. To implement load balancing, the team made optimizations to the following aspects:

They configured the API servers to consider clients as untrusted to prevent overloading due to excessive requests. Then, in the case an API server exceeds a load threshold, it sends 429 error code (to indicate too many requests received) to notify the clients of backoff. When the API server exceeds a higher load threshold, it closes client connections to deny requests. They configured clients to periodically reconnect to other API servers when frequently receiving the 429 error code within a specific period of time. At the O&M level, the team upgraded the API servers by setting maxSurge to 3 to prevent performance jitter during upgrades.

As shown by the monitor screenshot in the lower-left corner of the above figure, the loads among all the API servers are almost perfectly balanced after these optimizations, and the API servers quickly returned to the balanced state when two nodes restart (see the jitter).

List-Watch and Cacher

List-Watch is essential for server-client communication in Kubernetes. API servers watch the data changes and updates of objects in etcd through a reflector and store the changes and updates in the memory. Controllers and Kubernetes clients subscribe to data changes through a mechanism similar to List-Watch.

List-Watch adds the globally incremental resourceVersion of Kubernetes to prevent data loss during server-client reconnection in case of communication interruption. As shown in the following figure, the reflector stores the synchronized data version and notifies the API server of the current version (5) during reconnection. The API server calculates the starting position (7) of the client-required data based on the last change record in the memory.

This process looks simple and reliable but is not without problems.

The API server stores each type of object in a storage object. Storage objects are classified as follows:

Pod storage Node storage Configmap storage

Each type of storage provides a limited queue that stores the latest object changes to compensate for the lagging of watchers in scenarios such as retries. Generally, all types of storage share a space with an incremental version from 1 to n, as shown in the above figure. The version of pod storage is incremental but not necessarily continuous. A client may pay attention only to some pods when synchronizing data through List-Watch. In a typical scenario, the kubelet pays attention only to the pods that are related to its own node. As shown in the preceding figure, the kubelet pays attention only to Pods 2 and 5, marked in green.

The storage queue is limited due to its first in first out (FIFO) policy. When the queue of pods is updated, earlier updates are eliminated from the queue. As shown in the above figure, when the updates in the queue are unrelated to the client, the client retains the rv value as 5. In the case that the client reconnects to the API server after the rv value 5 is eliminated, the API server cannot determine whether there is a change that the client needs to perceive between the rv value 5 and the minimum value 7 that is held by the queue. Therefore, the API server returns Client too old version error to request the client to list all data. To solve this problem, Kubernetes introduces the Watch bookmark feature.

Watch bookmark maintains a heartbeat between the client and API server to enable prompt updates of the internal reflector version even when the queue has no changes for the client to perceive. As shown in the above figure, the API server pushes the latest rv value 12 to the client at the proper time so that the client version can keep up with the version updates on the API server. Watch bookmark reduces the number of events that need to be resynchronized during API server restart to 3% of the original event quantity, with performance improved by dozens of times. Watch bookmark was developed on Alibaba Cloud Container Platform and released in Kubernetes 1.15.

Cacher and Indexing

In addition to List-Watch, clients can access API servers through a direct query, as shown in the above figure. API servers take care of the query requests of clients by reading data from etcd to ensure that consistent data is retrieved by the same client from multiple API servers. However, with this system, the following performance problems occur:

Indexing is not supported. The pod for node query must first obtain all the pods in the cluster, which results in a huge overhead. If the amount of data queried by a single request is large, too much memory is consumed due to the request-response model adopted by etcd. The amount of requested data is limited by the queries that API servers make to etcd. Queries of massive data are completed through pagination, which results in multiple round trips that greatly reduce performance. To ensure consistency, API servers use Quorum read to query etcd. The query overhead is measured at the cluster level and cannot be expanded.

The team of engineers at Alibaba designed a data collaboration mechanism that works inbetween API servers and etcd to ensure that consistent data is retrieved by the same client from multiple API servers through their cache. The below figure shows the workflow of the data collaboration mechanism.

The client queries the API servers at time t0. The API servers request the current data version rv@t0 from etcd. The API servers update their request progress and wait until the data version of the reflector reaches rv@t0. A response is sent to the client through the cache.

The consistency model of the client is maintained in the data collaboration mechanism — you can try out if you are interested. The cache-based request-response model allows for flexible enhancement of the query capability by means of namespace nodename and labels. This enhancement greatly boosts the read request processing capability of API servers, reduces the describe-node time from 5s to 0.3s (with node name indexing triggered) in a 10,000-node cluster, and makes query operations such as Get Nodes more efficient.

Context-Aware Problems and Optimizations

To understand what are content-aware problems, first consider the following situation. The API Server needs to receive and complete requests to have access to external services. This could be, for example, to access to the etcd for persistent data, or the Webhook Server for extended Admission or Auth, or even the API Server itself.

Now the problem with all of this comes down to the way in which the API Server processes requests. For example, in the case that the client has terminated a request, the API Server would still continue to process the request regardless. This, of course, results in a blacklog of Goruntine and resources. Next, it could easily be the case that the client may try to intiate a new request, while the API Server is still likely to continue to request the initial request.This again would cause the API Server to run out of memory and crash.

But why is this “Content-aware”? Well, here “context” refers to the resources in the request by the client. Generally speaking, after the client’s request ends, the API Service should also reclaim the resources of the request and stop the request. This last part of this is important piece because, if the request isn’t stopped, the throughput of the API server will be dragged down by the backlog of Goruntine and related resources, causing the issues we discussed above.

The Engineers at Alibaba and Ant Financial have participated in the optimization of context-aware issues and features for the entire request process connected to the API Server. Kubernetes version 1.16 has optimized the Admission, Webhook components to improve the overall performance and throughput of the API Server.

Request Flood Prevention

The API Server is also vulnerable to receiving too many request. This is because, even though the API Server comes with the max-inflight filter, which limits maximum concurrent read/wwrites, there really aren’t any other filters for restricting the number of requests to the Server, making it vulnerable to crashes.

The API Server is an internal deployment, and therefore it doesn’t receive external requests, with requests coming from the internal components and modules of Kubernetes. Based on the observations and experience of the team of engineers at Alibaba, such API Server crashes are mainly caused by either one of the following two scenarios:

Issues during API Server restarts or updates

The API Server is at the core of Kubernetes, so, whenever the API server needs to be restarted or updated, all of Kubernetes’s components must be reconnected to it, which means that each and every component will send requests to the API Server. Among processes, reestablishing the List-Watch consumes a lot of resources. Both the API Server and etcd have a cache layer, so in the case the cache layer of the API Server cannot receive an incoming client request, the request can be directed to etcd. However, often it’s the case that the resources of etcd list may end up filling all of the network cache of both the API Server and etcd, causing the API Server to run out of memory and crash. Then, it’s also likely the client will continue to send even more requests, leading to an avalanche.

To resolve the problem described, the team at Alibaba used the “Active Reject” function. With this enabled, the API Server will reject larger incoming requests that the cache cannot handle. A rejection returns a 429 error code to the client, which will cause the client to wait some time before sending a new request. Generally speaking, this can optimize the IO between the API server and client, reducing the chances of the API server crashing.

The Client has a bug that causes an abnormal number of requests

One of the most typical bugs is a DaemonSet component bug, which causes the number of request to be multipled by the number of nodes, which then in turn ends up crashing the API Server.

To resolve this problem, you can use a method of emergency traffic control implemented with User-Agent to dynamically limit requests based on their source. With this enabled, you can easily restrict requests of the problem component that you identify as having the bug based on the data collected in the monitoring chart. By stopping its requests, you can thus project the API Server, and with the server protected, you can procede to fix the component with the bug. If the Server crashes, you wouldn’t be able to fix the bug. The reason you want to control traffic with User-Agent, instead of Identity, for this is because an Identity request can be consume a lot of resources, which could cause the server to crash.

The engineers at Alibaba and Ant Financial have provided these method to the community as a user stories.

Controller Failover

In a 10,000-node production cluster, one controller stores nearly one million objects. It consumes substantial overhead to retrieve these objects from the API server and deserialize them, and it may take several minutes to restart the controller for recovery. This is unacceptable for enterprises as large as Alibaba. To mitigate the impact on system availability during component upgrade, the team of engineers at Alibaba developed a solution to minimize the duration of system interruption due to controller upgrade. The following figure shows how this solution works.

Controller informers are pre-started to pre-load the data required by the controller. When the active controller is upgraded, Leader Lease is actively released to enable the standby controller to immediately take over services from the active controller.

This solution reduces the controller interruption time to several seconds (less than 2s during upgrade). Even in the case of abnormal downtime, the standby controller can resynchronize data after Leader Lease expires, which takes 15s by default rather than a couple of minutes. This solution greatly reduces the mean time to repair (MTTR) of controllers and mitigates the performance impact on API servers during controller recovery. This solution is also applicable to schedulers.

Customized Schedulers

Due to historical reasons, Alibaba schedulers adopt a self-developed architecture. Enhancements made to schedulers are not described in detail here due to length constraints. Rather, only two basic ideas — both nonetheless important ideas — are described here, which are also shown in the figure below.

Equivalence classes: A typical capacity expansion request is intended to scale up multiple containers at a time. Therefore, the team divided the requests in a pending queue into equivalence classes to batch-process these requests. This greatly reduced the number of predicates and priorities. Relaxed randomization: When a cluster provides many candidate nodes for serving one scheduling request, you only need to select enough nodes to process the request, without having to evaluate all nodes in the cluster. This method can improve the scheduling performance at the expense of solving accuracy.

Conclusion

Through a series of enhancements and optimizations, the engineering teams at Alibaba and Ant Financial were able to successfully apply Kubernetes to the production environments of Alibaba and reach the ultra-large scale of 10,000 nodes in a single cluster. To sum things up, specific improvements that were made include the following:

The storage capacity of etcd was improved by separating indexes and data and sharding data. The performance of etcd in storing massive data was greatly improved by optimizing the block allocation algorithm of the bbolt db storage engine at the bottom layer of etcd. Last, the system architecture was simplified by building large-scale Kubernetes clusters based on a single etcd cluster. The List performance bottleneck was removed to enable the stability of operation of 10,000-node Kubernetes clusters through a series of enhancements, such as Kubernetes lightweight heartbeats, improved load balancing among API servers in high-availability (HA) clusters, List-Watch bookmark, indexing, and caching. Hot standby mode greatly reduced the service interruption time during active/standby failover on controllers and schedulers. This also improved cluster availability. The performance of Alibaba-developed schedulers was effectively optimized through equivalence classes and relaxed randomization.

Based on these enhancements, Alibaba Cloud and Ant Financial now run their core services in 10,000-node Kubernetes clusters, which have withstood the immense test of the 2019 Tmall 618 Shopping Festival.

Original Source