Considerations on Node Pool Taints and Unmanaged Pods
Depending on the environment or cloud provider being used, a CNI plugin and/or configuration file may be pre-installed in nodes belonging to a given cluster where Cilium is being installed or already running. Upon starting on a given node, and if it is intended as the exclusive CNI plugin for the cluster, Cilium does its best to take ownership of CNI on the node. However, a couple situations can prevent this from happening:
Cilium can only take ownership of CNI on a node after starting. Pods starting before Cilium runs on a given node may get IPs from the pre-configured CNI.
Some cloud providers may revert changes made to the CNI configuration by Cilium during operations such as node reboots, updates or routine maintenance.
This is notably the case with GKE (non-Dataplane V2), in which node reboots and upgrades will undo changes made by Cilium and re-instate the default CNI configuration.
To help overcome this situation to the largest possible extent in environments and cloud providers where Cilium isn’t supported as the single CNI, Cilium can manipulate Kubernetes’s taints on a given node to help preventing pods from starting before Cilium runs on said node. The mechanism works as follows:
The cluster administrator places a specific taint (see below) on a given uninitialized node. Depending on the taint’s effect (see below), this prevents pods that don’t have a matching toleration from either being scheduled or altogether running on the node until the taint is removed.
Cilium runs on the node, initializes it and, once ready, removes the aforementioned taint.
From this point on, pods will start being scheduled and running on the node, having their networking managed by Cilium.
If Cilium is temporarily removed from the node, the Operator will re-apply the taint (but only with NoSchedule).
By default, the taint key is node.cilium.io/agent-not-ready
, but in some
scenarios (such as when Cluster Autoscaler is being used but its flags cannot be
configured) this key may need to be tweaked. This can be done using the
agent-not-ready-taint-key
option. In the aforementioned example, users should
specify a key starting with ignore-taint.cluster-autoscaler.kubernetes.io/
.
When such a value is used, the Cluster Autoscaler will ignore it when simulating
scheduling, allowing the cluster to scale up.
The taint’s effect should be chosen taking into account the following considerations:
If
NoSchedule
is used, pods won’t be scheduled to a node until Cilium has the chance to remove the taint. However, one practical effect of this is that if some external process (such as a reboot) resets the CNI configuration on said node, pods that were already scheduled will be allowed to start concurrently with Cilium when the node next reboots, and hence may become unmanaged and have their networking being managed by another CNI plugin.If
NoExecute
is used, pods won’t be executed (nor scheduled) on a node until Cilium has had the chance to remove the taint. One practical effect of this is that whenever the taint is added back to the node by some external process (such as during an upgrade or eventually a routine operation), pods will be evicted from the node until Cilium has had the chance to remove the taint.
Another important thing to consider is the concept of node itself, and the
different point of views over a node. For example, the instance/VM which backs a
Kubernetes node can be patched or reset filesystem-wise by a cloud provider, or
altogether replaced with an entirely new instance/VM that comes back with the
same name as the already-existing Kubernetes Node
resource. Even though in
said scenarios the node-pool-level taint will be added back to the Node
resource, pods that were already scheduled to the node having this name will run
on the node at the same time as Cilium, potentially becoming unmanaged. This is
why NoExecute
is recommended, as assuming the taint is added back in this
scenario, already-scheduled pods won’t run.
However, on some environments or cloud providers, and as mentioned above, it may
happen that a taint established at the node-pool level is added back to a node
after Cilium has removed it and for reasons other than a node upgrade/reset.
The exact circumstances in which this may happen may vary, but this may lead to
unexpected/undesired pod evictions in the particular case when NoExecute
is
being used as the taint effect. It is, thus, recommended that in each deployment
and depending on the environment or cloud provider, a careful decision is made
regarding the taint effect (or even regarding whether to use the taint-based
approach at all) based on the information above, on the environment or cloud
provider’s documentation, and on the fact that one is essentially establishing
a trade-off between having unmanaged pods in the cluster (which can lead to
dropped traffic and other issues) and having unexpected/undesired evictions
(which can lead to application downtime).
Taking into account all of the above, throughout the Cilium documentation we
recommend NoExecute
to be used as we believe it to be the least disruptive
mode that users can use to deploy Cilium on cloud providers.