By Andrew Chen and Dominik Tornow

Kubernetes is a Container Orchestration Engine designed to host containerized applications on a set of nodes, commonly referred to as a cluster. Using a systems modeling approach, this series aims to advance the understanding of Kubernetes and its underlying concepts.

The Kubernetes Scheduler is a core component of Kubernetes: After a user or a controller creates a Pod, the Kubernetes Scheduler, monitoring the Object Store for unassigned Pods, will assign the Pod to a Node. Then, the Kubelet, monitoring the Object Store for assigned Pods, will execute the Pod.

This blog post provides a concise, detailed model of the Kubernetes Scheduler. The model is supported by partial specifications in TLA+.

Figure 1. Processing a Pod

Scheduling

The task of the Kubernetes Scheduler is to choose a placement. A placement is a partial, non-injective assignment of a set of Pods to a set of Nodes.

Figure 2. Example Schedule

Scheduling is an optimization problem: First, the Scheduler determines the set of feasible placements, which is the set of placements that meet a set of given constraints. Then, the Scheduler determines the set of viable placements, which is the set of feasible placements with the highest score.

Figure 3. Possible, Feasible, and Viable Schedules

The Kubernetes Scheduler is a multi-step scheduler ensuring a local optimum instead of being a single-step scheduler ensuring a global optimum.

Figure 4. Multi-Step vs. Single Step

The Kubernetes Scheduler

Figure 5. Kubernetes Pod Object and Kubernetes Node Object

Figure 5. depicts the Kubernetes Objects and attributes that are of interest to the Kubernetes Scheduler. Kubernetes represents

a Pod as a Kubernetes Pod Object,

a Node as a Kubernetes Node Object, and

the assignment of a Pod to a Node as the Pod’s .Spec.NodeName.

A Pod Object is bound to a Node Object if the Pod’s .Spec.NodeName equals the Node’s .Name

The task of the Kubernetes Scheduler can now more formally be described as: The Kubernetes Scheduler, for a Pod p, selects a Node n and updates(*) the Pod’s .Spec.NodeName so that BoundTo(p, n) is true.

The Control Loop

The Kubernetes Scheduler monitors the Kubernetes Object Store and chooses an unbound Pod of the highest priority to perform either a Scheduling Step or a Preemption Step.

Scheduling Step

For a given Pod, the Scheduling Step is enabled if there exists at least one Node, such that the Node is feasible to host the Pod.

If the Scheduling Step is enabled, the Scheduler will bind the Pod to a feasible Node, such that the binding will achieve the highest possible viability.

If the Scheduling Step is not enabled, the Scheduler will attempt to perform a Preemption Step.

Preemption Step

For a given Pod, the Preemption Step is enabled if there exists at least one Node, such that the Node is feasible to host the Pod if a subset of Pods with lower priorities bound to this Node were to be deleted.

If the Preemption Step is enabled, the Scheduler will trigger the deletion of a subset of Pods with lower priorities bound to one Node, such that the Preemption Step will inflict the lowest possible casualties.

(The inflicted casualty is assessed in terms of Pod Disruption Budget (PDB) violations, but is beyond the scope of this post.)

Note that the Scheduler does not guarantee that the Pod which triggered the Preemption Step will be bound to that Node in a subsequent Scheduling Step.

1. Feasibility

For each Pod, the Kubernetes Scheduler identifies the set of feasible Nodes, which is the set of Nodes that satisfy the constraints of the Pod.

Conceptually, the Kubernetes Scheduler defines a set of filter functions that, given a Pod and a Node, determine if the Node satisfies the constraints of the Pod. All filter functions must yield true for the Node to host the Pod.

The following subsections detail some of the available filter functions:

1.1 Schedulability and Lifecycle Phase

This filter function deems a Node feasible based on the Node’s schedulability and lifecycle phase. Node conditions are accounted for via taints and tolerations (see below).

Figure 1.1 Schedulability and Lifecycle Phase

1.2 Resource Requirements and Resource Availability

This filter function deems a Node feasible based on the Pod’s resource requirements and the Node’s resource availabilities.

Figure 1.2 Resource Requirements and Resource Availability

1.3 Node Selector

This filter function deems a Node feasible based on the Pod’s node selector values and the Node’s label values.

Figure 1.3 Node Selector

1.4 Node Taints and Pod Tolerations

This filter function deems a Node feasible based on the Pod’s taints’ key value pairs and the Node’s tolerations’ key value pairs.

Figure 1.4 Node Taints and Pod Tolerations

A Pod may be bound to a Node such that the Node’s Taints match the Pod’s Tolerations. A Pod must not be bound to a Node if the Node’s Taints do not match the Pod’s Tolerations.

1.5 Required Affinity

This filter function deems a Node feasible based on the Pod’s required Node Affinity Terms, Pod Affinity Terms, and Pod Anti Affinity Terms.

Figure 1.5 Required Affinity

Node Affinity

A Pod must be assigned to a Node such that the Node’s labels match the Pod’s Node Affinity Requirements. In addition, a Pod must not be assigned to a Node such that the Node’s labels do not match the Pods Node Affinity Requirements.

Pod Affinity

A Pod must be assigned to a Node such that at least one Pod on a Node matching the TopologyKey matches the Pod’s Pod Affinity Requirements.

Pod Anti-Affinity

A Pod must be assigned to a Node such that no Pod on a Node matching the TopologyKey matches the Pod’s Pod Anti-Affinity Requirements.

2. Viability

For each Pod, the Kubernetes Scheduler identifies the set of feasible Nodes, which is the set of Nodes that satisfy the constraints of the Pod. Then, the Kubernetes Scheduler identifies the set of feasible Nodes with the highest Viability.

Conceptually, the Kubernetes Scheduler defines a set of rating functions that, given a Pod and a Node, determine the viability of a Pod and Node Pair. Ratings are summed.

The following subsection details one available filter function:

2.1 Preferred Affinity

This filter function rates a Node’s viability based on the Pod’s preferred Node Affinity Terms, Pod Affinity Terms, and Pod Anti Affinity Terms.

Figure 2.1 Preferred Affinity

The rating is the sum of the

Sum of the Term Weights for each matching Node Selector Term

Sum of the Term Weights for each matching Pod Affinity Term

Sum of the Term Weights for each matching Pod Anti Affinity Term

Case Study

Figure 6. depicts a case study involving two different types of Nodes and two different types of Pods:

9 Nodes without GPU resources

6 Nodes with GPU resources

The objective of the case study is to ensure that:

Pods that do not require GPUs are assigned to Nodes without GPUs

Pods that do require GPUs are assigned to Nodes with GPUs

Figure 6. Case Study

(*) Kubernetes Binding Objects

The blog post states that the Kubernetes Scheduler binds a Pod to a Node by setting the Pod’s .Spec.NodeName to the Node’s Name. However, the Scheduler sets the .Spec.NodeName not directly but indirectly.

The Kubernetes Scheduler is not permitted to update a Pod’s .Spec. Therefore, instead of updating the Pod, the Kubernetes Scheduler creates a Kubernetes Binding Object. On creation of a Binding Object the Kubernetes API will update the Pod’s .Spec.NodeName.

About this post

This blog post is a summary of the KubeCon 2018 Contributor Summit’s “Unconference” track hosted by Google and SAP detailing the Kubernetes Scheduler.