don't let Rancher fleet-agent pods interfere with CAPI node rolling updates
Closes #3182 (closed)
This MR introduces a Kyverno policy to set CAPI drain-skip label on Rancher fleet-agent deployment pods.
#3182 (closed) arises because something in Rancher Fleet sets the node.kubernetes.io/unschedulable/NoSchedule toleration on fleet-agent pods, which at cordon/drain time results in the Pod possibly being rescheduled on the node being drained, preventing the drain from completing.
I couldn't easily find a way to not have this happen, so this MR addresses the problem by ensuring that Cluster API will not try to drain these pods, which is feasible by labelling them with cluster.x-k8s.io/drain: skip. This MR enforces this label by introducing a Kyverno policy.
Note well:
- I tried to see what sets the
node.kubernetes.io/unschedulable/NoScheduletoleration on fleet-agent pods, but I didn't find it -- fleet-agent is deployed by fleet-controller using a Fleet Bundle which includes manifest which are then deployed by fleet-controller... but thenode.kubernetes.io/unschedulable/NoScheduletoleration isn't added there (some other component adding the toleration later ?) - a Kyverno policy is the only way to set the
cluster.x-k8s.io/drain: skip-- see above, the code defining fleet-agent Deployment is deep in Fleet code with most things hardcoded (https://github.com/rancher/fleet/blob/04589b2e8a2dfea423b2c8d835a94c3a7fc4392e/internal/cmd/controller/agentmanagement/agent/manifest.go#L143)
I filed #3183 so that we work on removing fleet-agent in the mgmt cluster, and I'm trying to implement it (!6207), but this MR will remain relevant even after it is addressed, since we want Sylva to be robust including in a scenario where a workload cluster would be enrolled in Flux and would receive the fleet-agent.
Test
We can see in CI job artifacts that the fleet-agent-xxx pods in the cattle-fleet-local-system namespace have the desired label.
annotations:
cni.projectcalico.org/containerID: e0122795ac69fa7d9d4f9bd8fa6c08fe5de740db120b3bc71c81f1aeadf21f12
cni.projectcalico.org/podIP: 100.72.18.224/32
cni.projectcalico.org/podIPs: 100.72.18.224/32
creationTimestamp: "2025-11-28T06:49:56Z"
generateName: fleet-agent-7b978c48c9-
generation: 1
labels:
app: fleet-agent
>> cluster.x-k8s.io/drain: skip <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
pod-template-hash: 7b978c48c9
CI configuration
Below you can choose test deployment variants to run in this MR's CI.
Click to open to CI configuration
Legend:
| Icon | Meaning | Available values |
|---|---|---|
| Infra Provider |
capd, capo, capm3
|
|
| Bootstrap Provider |
kubeadm (alias kadm), rke2, okd, ck8s
|
|
| Node OS |
ubuntu, suse, na, leapmicro
|
|
| Deployment Options |
light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging, cilium
|
|
| Pipeline Scenarios | Available scenario list and description | |
| Enabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
-
🎬 preview☁️ capd🚀 kadm🐧 ubuntu -
🎬 preview☁️ capo🚀 rke2🐧 suse -
🎬 preview☁️ capm3🚀 rke2🐧 ubuntu -
☁️ capd🚀 kadm🛠️ light-deploy🐧 ubuntu -
☁️ capd🚀 rke2🛠️ light-deploy🐧 suse -
☁️ capo🚀 rke2🐧 suse -
☁️ capo🚀 rke2🐧 leapmicro -
☁️ capo🚀 kadm🐧 ubuntu -
☁️ capo🚀 kadm🐧 ubuntu🟢 neuvector,mgmt:harbor -
☁️ capo🚀 rke2🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capo🚀 kadm🎬 wkld-k8s-upgrade🐧 ubuntu -
☁️ capo🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capo🚀 kadm🎬 rolling-update-no-wkld🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc,openbao🐧 suse -
☁️ capo🚀 rke2🐧 suse🎬 upgrade-from-prev-tag -
☁️ capm3🚀 rke2🐧 suse -
☁️ capm3🚀 kadm🐧 ubuntu -
☁️ capm3🚀 ck8s🐧 ubuntu -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha,misc🐧 ubuntu -
☁️ capm3🚀 rke2🎬 wkld-k8s-upgrade🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha🐧 suse -
☁️ capm3🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🛠️ misc,ha🐧 suse -
☁️ capm3🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha,misc🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 suse -
☁️ capm3🚀 ck8s🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2|okd🎬 no-update🐧 ubuntu|na -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-from-release-1.5 -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-to-main
Global config for deployment pipelines
-
autorun pipelines -
allow failure on pipelines -
record sylvactl events
Notes:
- Enabling
autorunwill make deployment pipelines to be run automatically without human interaction - Disabling
allow failurewill make deployment pipelines mandatory for pipeline success. - if both
autorunandallow failureare disabled, deployment pipelines will need manual triggering but will be blocking the pipeline
Be aware: after configuration change, pipeline is not triggered automatically.
Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.