"managed-system-upgrade-controller" cattle_feature is enabled by default in Rancher chart, interferes with CAPI node rolling update
By analyzing multiple CI runs stuck in workload-cluster upgrades (ex: https://gitlab.com/sylva-projects/sylva-core/-/jobs/10355269367), we noticed that the cattle-system/system-upgrade-controller pods doesn't seem to evacuate properly from machines being deleted.
It appears to be related to its tolerations, but upon further investigation, it seems that this Rancher feature is unnecessary for our use case (we don't need to upgrade clusters/nodes via this mechanism due to our Sylva stack).
upstream documentation : \
managed-system-upgrade-controller: Enables the installation of the system-upgrade-controller app in downstream RKE2/K3s clusters, currently limited to imported clusters and the local cluster, with plans to expand support to node-driver clusters.
https://docs.rke2.io/upgrades/automated_upgrade
https://docs.k3s.io/upgrades/automated
Full analysis CI RUN :
Error message during workload-cluster upgrade :
Timed-out waiting for the following resources to be ready:
IDENTIFIER STATUS REASON MESSAGE
Kustomization/wc-41732558-rke2-capm3-xxx/cluster InProgress Kustomization generation is 2, but latest observed generation is 1
╰┄╴HelmRelease/wc-41732558-rke2-capm3-xxx/cluster Ready Resource is Ready
├┄╴Cluster/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx InProgress Rolling 3 replicas with outdated spec (0 replicas up to date)
┆ ╰┄╴RKE2ControlPlane/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane InProgress Rolling 3 replicas with outdated spec (0 replicas up to date)
┆ ╰┄╴Machine/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane-qkkqc Terminating Resource scheduled for deletion
┆ ╰┄╴┬┄┄[Conditions]
┆ ├┄╴Ready False Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
┆ ┆ * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
┆ ├┄╴AgentHealthy True
┆ ├┄╴BootstrapReady True
┆ ├┄╴DrainingSucceeded False Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
┆ ┆ * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
┆ ├┄╴EtcdMemberHealthy False Deleting
┆ ├┄╴InfrastructureReady True
┆ ├┄╴NodeHealthy True
┆ ├┄╴NodeMetadataUpToDate True
┆ ╰┄╴PreDrainDeleteHookSucceeded True
╰┄╴RKE2ControlPlane/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane InProgress Rolling 3 replicas with outdated spec (0 replicas up to date)
╰┄╴Machine/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane-qkkqc Terminating Resource scheduled for deletion
╰┄╴┬┄┄[Conditions]
├┄╴Ready False Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
┆ * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
├┄╴AgentHealthy True
├┄╴BootstrapReady True
├┄╴DrainingSucceeded False Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
┆ * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
├┄╴EtcdMemberHealthy False Deleting
├┄╴InfrastructureReady True
├┄╴NodeHealthy True
├┄╴NodeMetadataUpToDate True
╰┄╴PreDrainDeleteHookSucceeded
Machine state :
wc-42155154-rke2-capm3-xxx wc-42155154-rke2-capm3-xxx-control-plane-rqwcg wc-42155154-rke2-capm3-xxx wc-42155154-rke2-capm3-xxx-dl360-36 metal3://wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-lannion-dl360-36/wc-42155154-rke2-capm3-xxx-cp-1cc1725239-9c4zf Deleting 6h42m v1.30.11+rke2r1
capi logs on management cluster:
I0616 00:11:03.458771 1 machine_controller.go:911] "Drain not completed yet, requeuing in 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" namespace="wc-42155154-rke2-capm3-xxx" name="wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" reconcileID="0f9874b9-3dd0-4164-9497-59b2599df77c" Cluster="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx" RKE2ControlPlane="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane" Node="wc-42155154-rke2-capm3-xxx-dl360-36" podsFailedEviction="" podsWithDeletionTimestamp="cattle-system/system-upgrade-controller-5594ffbb7c-777mn" podsToTriggerEvictionLater="" podsToWaitCompletedNow="" podsToWaitCompletedLater=""
I0616 00:11:23.536198 1 machine_controller.go:891] "Draining Node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" namespace="wc-42155154-rke2-capm3-xxx" name="wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" reconcileID="626ee27c-c46e-4239-94c8-212f037063e5" Cluster="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx" RKE2ControlPlane="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane" Node="wc-42155154-rke2-capm3-xxx-dl360-36"
I0616 00:11:23.536327 1 drain.go:308] "Drain not completed yet, there are still Pods on the Node that have to be drained" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" namespace="wc-42155154-rke2-capm3-xxx" name="wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" reconcileID="626ee27c-c46e-4239-94c8-212f037063e5" Cluster="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx" RKE2ControlPlane="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane" Node="wc-42155154-rke2-capm3-xxx-dl360-36" podsToTriggerEvictionNow="cattle-system/system-upgrade-controller-5594ffbb7c-t7srs" podsToTriggerEvictionLater="" podsWithDeletionTimestamp="" podsToWaitCompletedNow="" podsToWaitCompletedLater=""
The pod cattle-system/system-upgrade-controller-5594ffbb7c-t7srs seems the culprit
Pods on wc cluster (grep systeem-upgrade):
cattle-system system-upgrade-controller-5594ffbb7c-b5275 0/1 ContainerCreating 0 0s <none> wc-42155154-rke2-capm3-xxx-dl360-36 <none> <none>
cattle-system system-upgrade-controller-5594ffbb7c-skrdh 1/1 Terminating 0 20s 100.72.213.246 wc-42155154-rke2-capm3-xxx-dl360-36 <none> <none>
It seems that the pods are rescheduled on dl360-36 despit the fact that the machine in deleting
Node taints:
taints:
- effect: NoSchedule
key: node.kubernetes.io/unschedulable <<<<<<<
timeAdded: "2025-06-15T21:15:23Z"
unschedulable: true
status:
addresses:
- address: 172.20.36.187
type: InternalIP
- address: wc-42155154-rke2-capm3-xxx-dl360-36
type: Hostname
The node is unschedulable
Tolerations of system-upgrade-controller pod:
nodeName: wc-42155154-rke2-capm3-xxx-dl360-36
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: system-upgrade-controller
serviceAccountName: system-upgrade-controller
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists <<<<<<<<<<<<<<<<<
This means that the pod can be scheduled on any node, even if it has taints, because the "Exists" toleration indicates that it tolerates all present taints.
replicaset cattle-cluster-agent
containers:
- env:
- name: CATTLE_FEATURES
value: embedded-cluster-api=false,fleet=false,managed-system-upgrade-controller=true,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningprebootstrap=false,provisioningv2=false,rke2=false,ui-sql-cache=false
- name: CATTLE_IS_RKE
value: "false"
- name: CATTLE_SERVER
value: https://rancher.sylva
- name: CATTLE_CA_CHECKSUM
value: ada1264cb0631611a5ae230fe4c9b599c10761ed065559fa8481523298a41a78
- name: CATTLE_CLUSTER
value: "true"
- name: CATTLE_K8S_MANAGED
value: "true"
- name: CATTLE_CLUSTER_REGISTRY
value: registry.rancher.com
- name: CATTLE_CREDENTIAL_NAME
value: cattle-credentials-52ad19cf26
- name: CATTLE_SERVER_VERSION
value: v2.10.6
- name: CATTLE_INSTALL_UUID
value: c04fd186-66b2-4bb5-90f5-dd0aaecd5498
- name: CATTLE_INGRESS_IP_DOMAIN
value: sslip.io
- name: STRICT_VERIFY
value: "true"
image: registry.rancher.com/rancher/rancher-agent:v2.10.6
imagePullPolicy: IfNotPresent
name: cluster-register
The feature managed-system-upgrade-controller=true is interesting as it seems to control the activation of the system-upgrade-controller deployment.