"managed-system-upgrade-controller" cattle_feature is enabled by default in Rancher chart, interferes with CAPI node rolling update

By analyzing multiple CI runs stuck in workload-cluster upgrades (ex: https://gitlab.com/sylva-projects/sylva-core/-/jobs/10355269367), we noticed that the cattle-system/system-upgrade-controller pods doesn't seem to evacuate properly from machines being deleted.

It appears to be related to its tolerations, but upon further investigation, it seems that this Rancher feature is unnecessary for our use case (we don't need to upgrade clusters/nodes via this mechanism due to our Sylva stack).

upstream documentation : \

managed-system-upgrade-controller: Enables the installation of the system-upgrade-controller app in downstream RKE2/K3s clusters, currently limited to imported clusters and the local cluster, with plans to expand support to node-driver clusters.

https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-references/feature-flags

https://docs.rke2.io/upgrades/automated_upgrade

https://docs.k3s.io/upgrades/automated


Full analysis CI RUN :

Error message during workload-cluster upgrade :

Timed-out waiting for the following resources to be ready:
IDENTIFIER                                                                                                                                                                                    STATUS      REASON   MESSAGE
Kustomization/wc-41732558-rke2-capm3-xxx/cluster                                                                                                                                          InProgress           Kustomization generation is 2, but latest observed generation is 1
╰┄╴HelmRelease/wc-41732558-rke2-capm3-xxx/cluster                                                                                                                                         Ready                Resource is Ready
   ├┄╴Cluster/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx                                                                                                                   InProgress           Rolling 3 replicas with outdated spec (0 replicas up to date)
   ┆  ╰┄╴RKE2ControlPlane/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane                                                                                         InProgress           Rolling 3 replicas with outdated spec (0 replicas up to date)
   ┆     ╰┄╴Machine/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane-qkkqc                                                                                         Terminating          Resource scheduled for deletion
   ┆        ╰┄╴┬┄┄[Conditions]
   ┆           ├┄╴Ready                                                                                                                                                                       False       Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
   ┆           ┆  * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
   ┆           ├┄╴AgentHealthy                                                                                                                                                                True
   ┆           ├┄╴BootstrapReady                                                                                                                                                              True
   ┆           ├┄╴DrainingSucceeded                                                                                                                                                           False Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
   ┆           ┆  * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
   ┆           ├┄╴EtcdMemberHealthy                                                                                                                                                           False       Deleting
   ┆           ├┄╴InfrastructureReady                                                                                                                                                         True
   ┆           ├┄╴NodeHealthy                                                                                                                                                                 True
   ┆           ├┄╴NodeMetadataUpToDate                                                                                                                                                        True
   ┆           ╰┄╴PreDrainDeleteHookSucceeded                                                                                                                                                 True
   ╰┄╴RKE2ControlPlane/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane                                                                                            InProgress           Rolling 3 replicas with outdated spec (0 replicas up to date)
      ╰┄╴Machine/wc-41732558-rke2-capm3-xxx/wc-41732558-rke2-capm3-xxx-control-plane-qkkqc                                                                                            Terminating          Resource scheduled for deletion
         ╰┄╴┬┄┄[Conditions]
            ├┄╴Ready                                                                                                                                                                          False       Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
            ┆  * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
            ├┄╴AgentHealthy                                                                                                                                                                   True
            ├┄╴BootstrapReady                                                                                                                                                                 True
            ├┄╴DrainingSucceeded                                                                                                                                                              False Draining Drain not completed yet (started at 2025-06-05T21:30:38Z):
            ┆  * Pods cattle-system/rancher-webhook-54cf7bc6cd-ddh22, cattle-system/system-upgrade-controller-584895cdb9-hpnnl: deletionTimestamp set, but still not removed from the Node
            ├┄╴EtcdMemberHealthy                                                                                                                                                              False Deleting
            ├┄╴InfrastructureReady                                                                                                                                                            True
            ├┄╴NodeHealthy                                                                                                                                                                    True
            ├┄╴NodeMetadataUpToDate                                                                                                                                                           True
            ╰┄╴PreDrainDeleteHookSucceeded  

Machine state :

wc-42155154-rke2-capm3-xxx   wc-42155154-rke2-capm3-xxx-control-plane-rqwcg     wc-42155154-rke2-capm3-xxx     wc-42155154-rke2-capm3-xxx-dl360-36     metal3://wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-lannion-dl360-36/wc-42155154-rke2-capm3-xxx-cp-1cc1725239-9c4zf   Deleting   6h42m   v1.30.11+rke2r1

capi logs on management cluster:

I0616 00:11:03.458771       1 machine_controller.go:911] "Drain not completed yet, requeuing in 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" namespace="wc-42155154-rke2-capm3-xxx" name="wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" reconcileID="0f9874b9-3dd0-4164-9497-59b2599df77c" Cluster="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx" RKE2ControlPlane="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane" Node="wc-42155154-rke2-capm3-xxx-dl360-36" podsFailedEviction="" podsWithDeletionTimestamp="cattle-system/system-upgrade-controller-5594ffbb7c-777mn" podsToTriggerEvictionLater="" podsToWaitCompletedNow="" podsToWaitCompletedLater=""
I0616 00:11:23.536198       1 machine_controller.go:891] "Draining Node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" namespace="wc-42155154-rke2-capm3-xxx" name="wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" reconcileID="626ee27c-c46e-4239-94c8-212f037063e5" Cluster="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx" RKE2ControlPlane="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane" Node="wc-42155154-rke2-capm3-xxx-dl360-36"
I0616 00:11:23.536327       1 drain.go:308] "Drain not completed yet, there are still Pods on the Node that have to be drained" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" namespace="wc-42155154-rke2-capm3-xxx" name="wc-42155154-rke2-capm3-xxx-control-plane-rqwcg" reconcileID="626ee27c-c46e-4239-94c8-212f037063e5" Cluster="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx" RKE2ControlPlane="wc-42155154-rke2-capm3-xxx/wc-42155154-rke2-capm3-xxx-control-plane" Node="wc-42155154-rke2-capm3-xxx-dl360-36" podsToTriggerEvictionNow="cattle-system/system-upgrade-controller-5594ffbb7c-t7srs" podsToTriggerEvictionLater="" podsWithDeletionTimestamp="" podsToWaitCompletedNow="" podsToWaitCompletedLater=""

The pod cattle-system/system-upgrade-controller-5594ffbb7c-t7srs seems the culprit

Pods on wc cluster (grep systeem-upgrade):

cattle-system              system-upgrade-controller-5594ffbb7c-b5275                         0/1     ContainerCreating   0          0s      <none>           wc-42155154-rke2-capm3-xxx-dl360-36   <none>           <none>
cattle-system              system-upgrade-controller-5594ffbb7c-skrdh                         1/1     Terminating         0          20s     100.72.213.246   wc-42155154-rke2-capm3-xxx-dl360-36   <none>           <none>

It seems that the pods are rescheduled on dl360-36 despit the fact that the machine in deleting

Node taints:

    taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable   <<<<<<<
      timeAdded: "2025-06-15T21:15:23Z"
    unschedulable: true
  status:
    addresses:
    - address: 172.20.36.187
      type: InternalIP
    - address: wc-42155154-rke2-capm3-xxx-dl360-36
      type: Hostname

The node is unschedulable

Tolerations of system-upgrade-controller pod:

    nodeName: wc-42155154-rke2-capm3-xxx-dl360-36
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: system-upgrade-controller
    serviceAccountName: system-upgrade-controller
    terminationGracePeriodSeconds: 30
    tolerations:
    - operator: Exists <<<<<<<<<<<<<<<<<

This means that the pod can be scheduled on any node, even if it has taints, because the "Exists" toleration indicates that it tolerates all present taints.

replicaset cattle-cluster-agent

        containers:
        - env:
          - name: CATTLE_FEATURES
            value: embedded-cluster-api=false,fleet=false,managed-system-upgrade-controller=true,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningprebootstrap=false,provisioningv2=false,rke2=false,ui-sql-cache=false
          - name: CATTLE_IS_RKE
            value: "false"
          - name: CATTLE_SERVER
            value: https://rancher.sylva
          - name: CATTLE_CA_CHECKSUM
            value: ada1264cb0631611a5ae230fe4c9b599c10761ed065559fa8481523298a41a78
          - name: CATTLE_CLUSTER
            value: "true"
          - name: CATTLE_K8S_MANAGED
            value: "true"
          - name: CATTLE_CLUSTER_REGISTRY
            value: registry.rancher.com
          - name: CATTLE_CREDENTIAL_NAME
            value: cattle-credentials-52ad19cf26
          - name: CATTLE_SERVER_VERSION
            value: v2.10.6
          - name: CATTLE_INSTALL_UUID
            value: c04fd186-66b2-4bb5-90f5-dd0aaecd5498
          - name: CATTLE_INGRESS_IP_DOMAIN
            value: sslip.io
          - name: STRICT_VERIFY
            value: "true"
          image: registry.rancher.com/rancher/rancher-agent:v2.10.6
          imagePullPolicy: IfNotPresent
          name: cluster-register

The feature managed-system-upgrade-controller=true is interesting as it seems to control the activation of the system-upgrade-controller deployment.

Assignee Loading
Time tracking Loading