fleet-agent pod resurected after eviction, because it tolerates the cordon taint (#3871) · Issues · Sylva-projects / sylva-core · GitLab

fleet-agent pod resurected after eviction, because it tolerates the cordon taint

This is a follow-up to https://gitlab.com/sylva-projects/sylva-core/-/work_items/3182 The same problem still exists. The following was observed in the context of !5900, but is not specific to the evolution brought by this MR. But now we've got the "root cause": Rancher, when installing the fleet-agent on the mgmt cluster, is giving fleet-agent pods a toleration to any taint found on control nodes. The code is in [addCpTaintsToTolerations](https://github.com/snasovich/rancher/blob/0af695cf8305b7a351d42dacac83fb65c5de10c0/pkg/controllers/provisioningv2/fleetcluster/fleetcluster.go#L107). This is regularly updated by a reconcilation loop, and the result is that on a Node drain, because the `NoSchedule node.kubernetes.io/unschedulable` taint is set on the Node (for cordoning), the fleet-agent Pods ends up being resistant to cordoning. As a result: * if we don't prevent these Pods from being drained, each drained pod is instantly recreated possibly on the same Node * if we prevent these Pods from being drained (as was done in !6212), the Pods remain on the CP Node, and because that CP node is being torn down, the fleet-agent (at least sometimes) suffers from side-effects (we observed failure to get a lease from k8s API, and because the fleet-controller checks their health and fully recreates the fleet-agent Deployment, we also end-up with a recreation loop The best would be to fully remove fleet agent from mgmt cluster (see #3183+). But in the meantime we need to ensure that it does not get the toleration to the cordon taint.

issue