Update rancher-webhook policy to fix drain stuck during workload cluster upgrade
What does this MR do and why?
During workload cluster upgrades, node drain gets stuck due to the following error:
Pod cattle-system/rancher-webhook-xxxx: cannot evict pod as it would violate the pod's disruption budget. The disruption budget rancher-webhook-pdb needs 1 healthy pod and has 1 currently.
Analysis shows that the Rancher webhook Deployment (rancher-webhook) was created with only 1 replica, while its PDB requires 1 minAvailable, blocking eviction during upgrade.
Root cause:
Before upgrade, Rancher webhook (v0.6.4) deployed with replicas: 2.
During upgrade, Rancher webhook (v0.8.3) recreated with replicas: 1.
The Kyverno policy that enforces replicas: 2 was applied after the Deployment was created.
Since the policy did not include mutateExistingOnPolicyUpdate: true, Kyverno’s background controller skipped patching the existing Deployment (seen as “empty resource to patch” in logs).
As a result, replicas remained 1, causing PDB violation and drain loop.
Fix:
Add mutateExistingOnPolicyUpdate: true to the rancher-webhook-replicas rule in the Kyverno policy to ensure mutation applies to existing Deployments when the policy is (re)applied.
Related reference(s)
closes #3191 (closed)
Test coverage
CI configuration
Below you can choose test deployment variants to run in this MR's CI.
Click to open to CI configuration
Legend:
| Icon | Meaning | Available values |
|---|---|---|
| Infra Provider |
capd, capo, capm3
|
|
| Bootstrap Provider |
kubeadm (alias kadm), rke2, okd, ck8s
|
|
| Node OS |
ubuntu, suse, na, leapmicro
|
|
| Deployment Options |
light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging, cilium
|
|
| Pipeline Scenarios | Available scenario list and description | |
| Enabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
-
🎬 preview☁️ capd🚀 kadm🐧 ubuntu -
🎬 preview☁️ capo🚀 rke2🐧 suse -
🎬 preview☁️ capm3🚀 rke2🐧 ubuntu -
☁️ capd🚀 kadm🛠️ light-deploy🐧 ubuntu -
☁️ capd🚀 rke2🛠️ light-deploy🐧 suse -
☁️ capo🚀 rke2🐧 suse -
☁️ capo🚀 rke2🐧 leapmicro -
☁️ capo🚀 kadm🐧 ubuntu -
☁️ capo🚀 kadm🐧 ubuntu🟢 neuvector,mgmt:harbor -
☁️ capo🚀 rke2🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capo🚀 kadm🎬 wkld-k8s-upgrade🐧 ubuntu -
☁️ capo🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc,openbao🐧 suse -
☁️ capo🚀 rke2🐧 suse🎬 upgrade-from-prev-tag -
☁️ capm3🚀 rke2🐧 suse -
☁️ capm3🚀 kadm🐧 ubuntu -
☁️ capm3🚀 ck8s🐧 ubuntu -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha,misc🐧 ubuntu -
☁️ capm3🚀 rke2🎬 wkld-k8s-upgrade🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha🐧 suse -
☁️ capm3🚀 rke2🛠️ misc,ha🐧 suse -
☁️ capm3🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha,misc🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 suse -
☁️ capm3🚀 ck8s🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2|okd🎬 no-update🐧 ubuntu|na -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-from-release-1.5 -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-to-main
Global config for deployment pipelines
-
autorun pipelines -
allow failure on pipelines -
record sylvactl events
Notes:
- Enabling
autorunwill make deployment pipelines to be run automatically without human interaction - Disabling
allow failurewill make deployment pipelines mandatory for pipeline success. - if both
autorunandallow failureare disabled, deployment pipelines will need manual triggering but will be blocking the pipeline
Be aware: after configuration change, pipeline is not triggered automatically.
Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.