This is part four of a five four-part series on scaling game servers with Kubernetes.

In the previous three posts, we hosted our game servers on Kubernetes, measured and limited their resource usage, and scaled up the nodes in our cluster based on that usage. Now we need to tackle the harder problem: scaling down the nodes in our cluster as resources are no longer being used, while ensuring that in-progress games are not interrupted when a node is deleted.

On the surface, scaling down nodes in our cluster may seem particularly complicated. Each game server has in-memory state of the current game and multiple game clients are connected to an individual game server playing a game. Deleting arbitrary nodes could potentially disconnect active players — and that tends to make them angry! Therefore, we can only remove nodes from a cluster when a node is empty of dedicated game servers.

This means that if you are running on Google Kubernetes Engine (GKE), or similar, you can’t use a managed autoscaling system. To quote the documentation for the GKE autoscaler “Cluster autoscaler assumes that all replicated Pods can be restarted on some other node…” — which in our case is definitely not going to work, since it could easily delete nodes that have active players on them.

That being said, when looking at this situation more closely, we discover that we can break this down into three separate strategies that when combined together make scaling down a manageable problem that we can implement ourselves:

Group game servers together to avoid fragmentation across the cluster Cordon nodes when CPU capacity is above the configured buffer Delete a cordoned node from the cluster once all the games on the node have exited

Let’s look at each of these detail.

Grouping Game Servers Together in the Cluster

We want to avoid fragmentation of game servers across the cluster so we don’t end up with a wayward small set of game servers still running across multiple nodes, which will prevent those nodes from being shut down and reclaiming their resources.

This means we don’t want a scheduling pattern that creates game server Pods on random nodes across our cluster like this:

But instead want to have our game server Pods scheduled packed as tight as possible like this:

To group our game servers together, we can take advantage of Kubernetes Pod PodAffinity configuration with the PreferredDuringSchedulingIgnoredDuringExecution option. This gives us the ability to tell Pods that we prefer to group them by the hostname of the node that they are currently on, which essentially means that Kubernetes will prefer to put a dedicated game server Pod on a node that already has a dedicated game server Pod on it already.

In an ideal world, we would want a dedicated game server Pod to be scheduled on the node with the most dedicated game server Pods, as long as that node also has enough spare CPU resources. We could definitely do this if we wanted to write our own custom scheduler for Kubernetes, but to keep this demo simple, we will stick with the PodAffinity solution. That being said, when we consider the short length of our games, and that we will be adding (and explaining) cordoning nodes shortly, this combination of techniques is good enough for our requirements, and removes the need for us to write additional complex code.

When we add the PodAffinity configuration to the previous post’s configuration, we end up with the following, which tells Kubernetes to put pods with the labels sessions: game on the same node as each other whenever possible.

pod.yaml apiVersion: v1 kind: Pod metadata: generateName: "game-" spec: hostNetwork: true restartPolicy: Never nodeSelector: role: game-server containers: - name: soccer-server image: gcr.io/soccer/soccer-server:0.1 env: - name: SESSION_NAME valueFrom: fieldRef: fieldPath: metadata.name resources: limits: cpu: "0.1" affinity: podAffinity: # group game server Pods preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: sessions: game topologyKey: kubernetes.io/hostname 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 apiVersion : v1 kind : Pod metadata : generateName : "game-" spec : hostNetwork : true restartPolicy : Never nodeSelector : role : game-server containers : - name : soccer-server image : gcr.io/soccer/soccer-server :0.1 env : - name : SESSION_NAME valueFrom : fieldRef : fieldPath : metadata.name resources : limits : cpu : "0.1" affinity : podAffinity : # group game server Pods preferredDuringSchedulingIgnoredDuringExecution : - podAffinityTerm : labelSelector : matchLabels : sessions : game topologyKey : kubernetes.io/hostname

Cordoning Nodes

Now that we have our game servers relatively well packed together in the cluster, we can discuss “cordoning nodes”. What does cordoning nodes really mean? Very simply, Kubernetes gives us the ability to tell the scheduler: “Hey scheduler, don’t schedule anything new on this node here”. This ensures that no new Pods get scheduled on that node. In fact, in some places in the Kubernetes documentation, this is simply referred to as marking a node unschedulable.

In the code below, if you focus on the section s.bufferCount < available you will see that we make a request to cordon nodes if the amount of CPU buffer we currently have is greater than what we have set as our need. We’ve stripped some parts out for brevity, but you can see the original here.

scaler.go // scale scales nodes up and down, depending on CPU constraints // this includes adding nodes, cordoning them as well as deleting them func (s Server) scaleNodes() error { nl, err := s.newNodeList() if err != nil { return err } available := nl.cpuRequestsAvailable() if available < s.bufferCount { finished, err := s.uncordonNodes(nl, s.bufferCount-available) // short circuit if uncordoning means we have enough buffer now if err != nil || finished { return err } nl, err := s.newNodeList() if err != nil { return err } // recalculate available = nl.cpuRequestsAvailable() err = s.increaseNodes(nl, s.bufferCount-available) if err != nil { return err } } else if s.bufferCount < available { err := s.cordonNodes(nl, available-s.bufferCount) if err != nil { return err } } return s.deleteCordonedNodes() } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 // scale scales nodes up and down, depending on CPU constraints // this includes adding nodes, cordoning them as well as deleting them func ( s Server ) scaleNodes ( ) error { nl , err : = s . newNodeList ( ) if err != nil { return err } available : = nl . cpuRequestsAvailable ( ) if available < s . bufferCount { finished , err : = s . uncordonNodes ( nl , s . bufferCount - available ) // short circuit if uncordoning means we have enough buffer now if err != nil || finished { return err } nl , err : = s . newNodeList ( ) if err != nil { return err } // recalculate available = nl . cpuRequestsAvailable ( ) err = s . increaseNodes ( nl , s . bufferCount - available ) if err != nil { return err } } else if s . bufferCount < available { err : = s . cordonNodes ( nl , available - s . bufferCount ) if err != nil { return err } } return s . deleteCordonedNodes ( ) }

As you can also see from the code above, we can uncorden any available cordoned nodes in the cluster if we drop below the configured CPU buffer. This is faster than adding a whole new node, so it’s important to check for cordoned nodes before adding a whole new node from scratch. Because of this we also have a configured delay on how long before a cordoned node is deleted (you can see the source here) to limit thrashing on creating and deleting nodes in the cluster unnecessarily.

This is a pretty great start. However, when we want to cordon nodes, we want to cordon only the nodes that have the least number of game server Pods on them, as in this instance, they are most likely to empty first as game sessions come to an end.

Thanks to the Kubernetes API, it’s relatively straightforward to count the number of game server Pods on each Node, and sort them in ascending order. From there we can do arithmetic to determine if we still remain above the desired CPU buffer if we cordon each of the available nodes. If so, we can safely cordon those nodes.

scaler.go // cordonNodes decrease the number of available nodes by the given number of cpu blocks (but not over), // but cordoning those nodes that have the least number of games currently on them func (s Server) cordonNodes(nl *nodeList, gameNumber int64) error { // … removed some input validation ... // how many nodes (n) do we have to delete such that we are cordoning no more // than the gameNumber capacity := nl.nodes.Items[0].Status.Capacity[v1.ResourceCPU] //assuming all nodes are the same cpuRequest := gameNumber * s.cpuRequest diff := int64(math.Floor(float64(cpuRequest) / float64(capacity.MilliValue()))) if diff <= 0 { log.Print("[Info][CordonNodes] No nodes to be cordoned.") return nil } log.Printf("[Info][CordonNodes] Cordoning %v nodes", diff) // sort the nodes, such that the one with the least number of games are first nodes := nl.nodes.Items sort.Slice(nodes, func(i, j int) bool { return len(nl.nodePods(nodes[i]).Items) < len(nl.nodePods(nodes[j]).Items) }) // grab the first n number of them cNodes := nodes[0:diff] // cordon them all for _, n := range cNodes { log.Printf("[Info][CordonNodes] Cordoning node: %v", n.Name) err := s.cordon(&n, true) if err != nil { return err } } return nil } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 // cordonNodes decrease the number of available nodes by the given number of cpu blocks (but not over), // but cordoning those nodes that have the least number of games currently on them func ( s Server ) cordonNodes ( nl * nodeList , gameNumber int64 ) error { // … removed some input validation ... // how many nodes (n) do we have to delete such that we are cordoning no more // than the gameNumber capacity : = nl . nodes . Items [ 0 ] . Status . Capacity [ v1 . ResourceCPU ] //assuming all nodes are the same cpuRequest : = gameNumber * s . cpuRequest diff : = int64 ( math . Floor ( float64 ( cpuRequest ) / float64 ( capacity . MilliValue ( ) ) ) ) if diff <= 0 { log . Print ( "[Info][CordonNodes] No nodes to be cordoned." ) return nil } log . Printf ( "[Info][CordonNodes] Cordoning %v nodes" , diff ) // sort the nodes, such that the one with the least number of games are first nodes : = nl . nodes . Items sort . Slice ( nodes , func ( i , j int ) bool { return len ( nl . nodePods ( nodes [ i ] ) . Items ) < len ( nl . nodePods ( nodes [ j ] ) . Items ) } ) // grab the first n number of them cNodes : = nodes [ 0 : diff ] // cordon them all for _ , n : = range cNodes { log . Printf ( "[Info][CordonNodes] Cordoning node: %v" , n . Name ) err : = s . cordon ( & n , true ) if err != nil { return err } } return nil }

Removing Nodes from the Cluster

Now that we have nodes in our clusters being cordoned, it is just a matter of waiting until the cordoned node is empty of game server Pods before deleting it. The code below also makes sure the node count never drops below a configured minimum as a nice baseline for capacity within our cluster.

You can see this in the code below, and in the original context:

scaler.go // deleteCordonedNodes will delete a cordoned node if it // the time since it was cordoned has expired func (s Server) deleteCordonedNodes() error { nl, err := s.newNodeList() if err != nil { return err } l := int64(len(nl.nodes.Items)) if l <= s.minNodeNumber { log.Print("[Info][deleteCordonedNodes] Already at minimum node count. exiting") return nil } var dn []v1.Node for _, n := range nl.cordonedNodes() { ct, err := cordonTimestamp(n) if err != nil { return err } pl := nl.nodePods(n) // if no game session pods && if they have passed expiry, then delete them if len(filterGameSessionPods(pl.Items)) == 0 && ct.Add(s.shutdown).Before(s.clock.Now()) { err := s.cs.CoreV1().Nodes().Delete(n.Name, nil) if err != nil { return errors.Wrapf(err, "Error deleting cordoned node: %v", n.Name) } dn = append(dn, n) // don't delete more nodes than the minimum number set if l--; l <= s.minNodeNumber { break } } } return s.nodePool.DeleteNodes(dn) } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 // deleteCordonedNodes will delete a cordoned node if it // the time since it was cordoned has expired func ( s Server ) deleteCordonedNodes ( ) error { nl , err : = s . newNodeList ( ) if err != nil { return err } l : = int64 ( len ( nl . nodes . Items ) ) if l <= s . minNodeNumber { log . Print ( "[Info][deleteCordonedNodes] Already at minimum node count. exiting" ) return nil } var dn [ ] v1 . Node for _ , n : = range nl . cordonedNodes ( ) { ct , err : = cordonTimestamp ( n ) if err != nil { return err } pl : = nl . nodePods ( n ) // if no game session pods && if they have passed expiry, then delete them if len ( filterGameSessionPods ( pl . Items ) ) == 0 && ct . Add ( s . shutdown ) . Before ( s . clock . Now ( ) ) { err : = s . cs . CoreV1 ( ) . Nodes ( ) . Delete ( n . Name , nil ) if err != nil { return errors . Wrapf ( err , "Error deleting cordoned node: %v" , n . Name ) } dn = append ( dn , n ) // don't delete more nodes than the minimum number set if l -- ; l <= s . minNodeNumber { break } } } return s . nodePool . DeleteNodes ( dn ) }

Conclusion

We’ve successfully containerised our game servers, scaled them up as demand increases, and now scaled our Kubernetes cluster down, so we don’t have to pay for underutilised machinery — all powered by the APIs and capabilities that Kubernetes makes available out of the box. While it would take more work to turn this into a production level system, you can already see how to take advantage of the many building blocks available to you.

Before we finish, I would like to apologise for the delay in producing the fourth part in this series. If you saw the announcement, you may have guessed that a lot of my time got taken up developing and releasing Agones, the open source, productised version of this series of posts on running game servers on Kubernetes.

On that note, this will also be the last installment in this series. I had already completed the work to implement scaling down before starting on Agones, and rather than build out new functionality for global cluster management on Paddle Soccer, I’m going to focus those efforts building out awesome new features for Agones and bring it up from its current 0.1 alpha release, to a full 1.0, production-ready milestone.

I’m very excited about the future of Agones, and if my series of blog posts have inspired you, watch the GitHub repository, join the Slack, follow us on Twitter and get involved the mailing list. We’re actively seeking more contributors, and would love to have you involved.

Lastly, I welcome questions and comments here, or reach out to me via Twitter. You can also see my presentation at GDC and GCAP from 2017 on this topic, as well as check out the code in GitHub.

All posts in this series: