Influencing Kubernetes Scheduler Decisions
Given so many methods for influencing Pod placement, learn about the flexibility to decide which nodes should be running your Pods.
A Guide to ConfigMap in Kubernetes
Introduction to Kubernetes Ingress
James Governor from Redmonk discusses: DX, Guardrails, Golden Paths & Policy Management in Kubernetes
To ensure maximum possible performance and availability given the infrastructure at hand, the scheduler uses complex algorithms to ensure the most efficient Pod placement. In this article, we discuss how the scheduler selects the best node to host the Pod and how we can influence its decision.
Which Node Has Available Resources?
When choosing the appropriate node, the scheduler examines each node for whether or not it can host the Pod. If you are following the Capacity Planning pattern, you’re already declaring the amount of CPU and memory your Pods require (through requests and limits). The scheduler uses the following equation to calculate the available memory on a given node:
Usable memory = available memory - reserved memory
The reserved memory refers to:
Memory used by Kubernetes daemons like kubelet, containerd (or another container runtime).
Memory is used by the node’s operating system. For example, kernel daemons.
By using this equation, the scheduler ensures that no resource starvation occurs on the node as a result of too many Pods competing to consume all the node’s available resources.
Influencing the Scheduling Process
When left without influence from the user, the scheduler does the following steps when scheduling a Pod to a node:
- The scheduler detects that a new Pod has been created and it is not yet assigned to a node.
- It examines the Pod requirements and - accordingly - sorts out all non-suitable nodes.
- Their weight orders the remaining nodes with the highest ones on the top.
- The scheduler chooses the first node in the sorted list and assigns the Pods to it.
Usually, it is wise to let the scheduler pick up the appropriate node as it sees fit (provided that the Pod requirements have been laid out firsthand). However, sometimes, you may need to influence this decision by forcing the scheduler to select a specific node or to manually add weight to several nodes to make them better candidates for Pod placement than others. Let’s have a look at how we can do this.
In the simplest (and most aggressive) forms of node selection, you simply force a Pod to run on one - and only one - node by specifying its name in the .spec.nodeName. For example, the following Pod definition force the Pod to get scheduled on app-prod01:
apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: nginx nodeName: app-prod01
Notice that this approach is the easiest yet the most unrecommended method of node selection for the following reasons:
- If for any reason the node with the specified name could not be located (like if its hostname was changed) the Pod will not run.
- If the node did not have the necessary resources to run the Pod, the Pod is not scheduled to other nodes; it will fail.
- The causes the Pods to be tightly coupled with their nodes, which is a bad design practice.
The first and easiest method of overriding the scheduler's decision is by using the .spec.nodeSelector parameter in the Pod definition (or Pod template if you are using a controller like Deployments). The nodeSelector accepts one or more key-value pairs of labels that must be available on the node for the Pod to get scheduled on it. Let's say that you have recently purchased two machines that are equipped with SSD disks. You want any Pod that hosts a database container to get scheduled on the SSD-backed nodes to receive the best DB performance. A Pod definition for your DB Pods may look as follows:
apiVersion: v1 kind: Pod metadata: name: db spec: containers: - name: mongodb image: mongo nodeSelector: disktype: ssd
Given that definition, only nodes that have the label disktype=ssd will be considered when the scheduler selects the suitable nodes for Pod assignment.
Additionally, you can use any of the built-in labels that are automatically assigned to the nodes to manipulate the selection decision. For example, the node’s hostname (kubernetes.io/hostname), architecture (kubernetes.io/arch), the OS (kubernetes.io/os) among others can be used in node selection.
Node selection is beneficial when you need to select specific nodes for running our Pods. But the way you choose the nodes is limited. Only the nodes that match all the defined labels are considered for Pod placement. Node Affinity gives you more flexibility by allowing you to define hard and soft node-requirements. The hard requirements must be matched on the node to be selected. On the other hand, the soft condition allows you to add more weight to nodes with specific labels so that they become higher in the list than their peers. A node that does not have the soft-requirements labels will not be ignored; it will only have less weight.
Let’s have an example: our database is I/O intensive. We need the database Pods to always run on SSD-backed nodes. Additionally, we’d have lower latency if the Pods were deployed on nodes that are located in regions zone1 or zone2 as they are physically closer to the application nodes. A Pod definition that addresses our needs may look like this:
apiVersion: v1 kind: Pod metadata: name: db spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disk-type operator: In values: - ssd preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: zone operator: In values: - zone1 - zone2 containers: - name: db image: mongo
The nodeAffinity stanza uses the following parameters to define the hard and soft requirements:
- requiredDuringSchedulingIgnoredDuringExecution: the node must have disk-type=ssd to be considered for deploying the DB Pod.
- preferredDuringSchedulingIgnoredDuringExecution: when sorting the nodes, the scheduler gives higher weight to nodes having the labels zone=zone1 or zone=zone2. If a node has disk-type=ssd and zone=zone1, it is preferred to another that has disk-type=ssd and no zone label or one that points to a different zone. The weight can be any value from 1 to 100. The weight number gives the matching node a relatively higher weight than other nodes. The higher the number, the higher the weight.
Notice that Node Affinity allows you to have more freedom when selecting which labels should exist (or not exist) on the target node when making the selection. In the example, we made use of the in operator to define more than one label, any of which should exist on our target node. Other operators are NotIn, Exists, DoesNotExists, Lt (less than), and Gt (greater than). It is worth noting that NotIn and DoesNotExist achieve what’s called Node Anti-Affinity.
Node Affinity and the Node Selectors are not mutually exclusive; they can coexist in the same definition file. However, in such a case, both the Node Selector and the Node Affinity hard requirements must match.
The Node Selector and Node Affinity (and anti-affinity) help us influence the scheduler’s decision as to where to place the Pods. However, it allows you to make the selection based on labels on the nodes only. It doesn’t care about the labels that the Pods themselves have. You may need to make selections based on the Pod labels in scenarios like:
- I need all middleware Pods to be placed together on the same physical node as those labeled role=frontend to decrease network latency between them.
- As a security best practice, we don’t want the middleware Pods to coexist with those which handle user authentication (role=auth). This is not a strict requirement.
As you can see, such requirements cannot be fulfilled with node selectors or affinity as the Pod labels are not considered in the selection process — only the node labels.
To address those needs, we use the Pod affinity and anti-affinity. In essence, they work the same way as node affinity and anti-affinity: we have hard requirements that must be fulfilled for the target node to be selected and soft conditions that increase the chance (weight) of having the chosen node but does not make it strictly required. Let’s have an example:
apiVersion: v1 kind: Pod metadata: name: middleware spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: role operator: In values: - frontend topologyKey: kubernetes.io/hostname podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: role operator: In values: - auth topologyKey: kubernetes.io/hostname containers: - name: middleware image: redis
In the above Pod definition file, we set the hard and soft requirements as follows:
requiredDuringSchedulingIgnoredDuringExecution: our Pod must be scheduled on nodes that have Pods labeled app-frontend.
preferredDuringSchedulingIgnoredDuringExecution: our Pod should not (but it could) be scheduled on nodes that have running Pods labeled role=auth. As with node affinity, soft requirement sets a weight from 1 to 100 to increase the node’s chance relative to other nodes. In our example, the soft requirement was placed in the podAntiAffinity, causing the node running Pod labeled role=auth to be less likely selected when the scheduler takes its decision.
The topologyKey is used to make more granular decisions as to which domain the rules would be applied. The topologyKey accepts a label key which must be present on the node to be considered during the selection process. In our example, we used an auto-populated label that is automatically added by default to all nodes and refers to the node’s hostname. But you may use other auto-populated labels or even your custom ones. For example, you may need to apply the Pod affinity rules only on nodes that have the rack or a zone label.
A Note About IgnoredDuringExecution
You may have noticed that both the hard and soft requirements have the IgnoredDuringExecution suffix. It means that after the scheduling decision has been made, the scheduler will not attempt to change already-placed Pods even if the conditions changed. For example, according to Node Affinity rules, a Pod was scheduled to a node having the label of app=prod. If that label changed to be app=dev, the Pod shall not be terminated and started on another node having app=prod. This behavior may change in the future to allow the scheduler to continually examine the node and pod affinity (and anti-affinity) rules after deployment.
Taints and Tolerations
In some scenarios, you may want to prevent Pods from getting scheduled to a specific node. Perhaps you are running diagnostic tests or scanning this node for threats, and you don’t want the application to be affected. Node anti-affinity can be used to achieve this goal. However, it is a significant administrative burden because you will need to add the anti-affinity rules to each new Pod that gets deployed to the cluster. For such a scenario, you should use taints.
When a node is tainted, no Pod can be scheduled to it unless the Pod tolerates the taint. The toleration is nothing but a key-value pair that matches that of the taint. Let’s have an example to demonstrate:
The host web01 needs to be tainted so that it doesn’t accept more Pods. The taint command can be issued as follows:
kubectl taint nodes web01 locked=true:NoSchedule
The above command places a taint on the node named web01 that has the following properties:
- A label locked=true. This label must be present on Pods that want to permit to the node (toleration).
- A taint type of NoSchedule. The taint type defines the behavior in which the taint is applied, and it has the following possibilities:
- NoSchedule: the system must not schedule any Pods to this node unless they have the matching toleration (hard requirement).
- PreferNoSchedule: the system should not (but could) place Pods on this node if they don’t have the matching toleration (soft limit).
- NoExecute: the system immediately evicts all the Pods that are already running on the node and do not have the matching toleration.
The definition file for a Pod that has the necessary toleration to get scheduled on the tainted node may look as follows:
apiVersion: v1 kind: Pod metadata: name: mypod spec: containers: - name: mycontainer image: nginx tolerations: - key: "locked" operator: "Equal" value: "true" effect: "NoSchedule"
Let’s have a closer look at the tolerations part of this definition:
- To have the correct toleration, we need to specify the key (locked), the value (true) and the operator.
- The operator can be one of two values:
- Equal: when using the equal operator, the key, value, and taint effect must match the node’s taint.
- Exists: when using the exists operator, you don’t need to match the taint value with the toleration (having a key: “locked” is sufficient).
- If you use the Exists operator, you can ignore the tolerations key, value, and effect. A Pod with such toleration can get scheduled to any tainted node.
Notice that placing toleration on a Pod does not guarantee that it gets deployed to tainted nodes. It only allows the action to happen. If you want to force the Pod to join a tainted node, you must also add node affinity to its definition as discussed earlier.
Automatic placement of containers on nodes is one of the very reasons why Kubernetes came into existence. As an administrator, you should not concern yourself with questions like which node has enough free resources to host those Pods as long as you make a good declaration for the Pod requirements. However, sometimes you have to manually interfere and override the system’s decision as to where to place Pods. In this article, we discussed several methods by which you can influence the scheduler to specific nodes more than others when deciding to deploy Pods. Let’s have a quick review of those methods:
- Node name: by adding a node’s hostname to the .spec.nodeName parameter of the Pod definition, you force this Pod to run on that specific node. Any selection algorithm used by the scheduler is ignored. This method is the least recommended.
- Node selector: by placing meaningful labels on your nodes, a Pod can use the nodeSelector parameter to specify one or more key-value label maps that must exist on the target node to get selected for running that Pod. This approach is more recommended because it adds a lot of flexibility and establishes a loosely-coupled node-pod relationship.
- Node affinity: this method adds even more flexibility when choosing which node should be considered for scheduling a particular Pod. Using Node Affinity, a Pod may strictly require to be scheduled on nodes with specific labels. It may also express some degree of preference towards particular nodes by influencing the scheduler to give them more weight.
- Pod affinity and anti-affinity: when Pod coexistence (or non-coexistence) with other Pods on the same node is essential, you can use this method. Pod affinity allows a Pod to require that it gets deployed on nodes that have Pods with specific labels running. Similarly, a Pod may force the scheduler not to place it on nodes having Pods with particular labels.
- Taints and tolerations: in this method, instead of deciding which nodes the Pod gets scheduled to, you decide which nodes should not accept any Pods at all or only selected Pods. By tainting a node, you’re instructing the scheduler not to consider this node for any Pod placement except if the Pod tolerates the taint. The toleration consists of a key, value, and the effect of the taint. Using an operator, you can decide whether the entire taint must match the toleration for a successful Pod placement or only a subset of the data must match.
Given so many methods for influencing Pod placement, you have a lot of flexibility to decide which nodes should be running your Pods. However, you should not interfere with the scheduler’s Pod placement algorithms unless strictly required.