Update rancher-webhook policy to fix drain stuck during workload cluster upgrade (!6223) · Merge requests · Sylva-projects / sylva-core

What does this MR do and why?

During workload cluster upgrades, node drain gets stuck due to the following error:

Pod cattle-system/rancher-webhook-xxxx: cannot evict pod as it would violate the pod's disruption budget. The disruption budget rancher-webhook-pdb needs 1 healthy pod and has 1 currently.

Analysis shows that the Rancher webhook Deployment (rancher-webhook) was created with only 1 replica, while its PDB requires 1 minAvailable, blocking eviction during upgrade.

Root cause:

Before upgrade, Rancher webhook (v0.6.4) deployed with replicas: 2.

During upgrade, Rancher webhook (v0.8.3) recreated with replicas: 1.

The Kyverno policy that enforces replicas: 2 was applied after the Deployment was created.

Since the policy did not include mutateExistingOnPolicyUpdate: true, Kyverno’s background controller skipped patching the existing Deployment (seen as “empty resource to patch” in logs).

As a result, replicas remained 1, causing PDB violation and drain loop.

Fix:

Add mutateExistingOnPolicyUpdate: true to the rancher-webhook-replicas rule in the Kyverno policy to ensure mutation applies to existing Deployments when the policy is (re)applied.

closes #3191 (closed)

Test coverage

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon	Meaning	Available values
☁️	Infra Provider	`capd`, `capo`, `capm3`
🚀	Bootstrap Provider	`kubeadm` (alias `kadm`), `rke2`, `okd`, `ck8s`
🐧	Node OS	`ubuntu`, `suse`, `na`, `leapmicro`
🛠️	Deployment Options	`light-deploy`, `dev-sources`, `ha`, `misc`, `maxsurge-0`, `logging`, `no-logging`, `cilium`
🎬	Pipeline Scenarios	Available scenario list and description
🟢	Enabled units	Any available units name, by default apply to management and workload cluster. Can be prefixed by `mgmt:` or `wkld:` to be applied only to a specific cluster type

Global config for deployment pipelines

autorun pipelines
allow failure on pipelines
record sylvactl events

Notes:

Enabling autorun will make deployment pipelines to be run automatically without human interaction
Disabling allow failure will make deployment pipelines mandatory for pipeline success.
if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Update rancher-webhook policy to fix drain stuck during workload cluster upgrade

What does this MR do and why?

Related reference(s)

Test coverage

CI configuration

Global config for deployment pipelines

Merge request reports