rewrite cluster-nodes-provider-id-blacklist Kyverno policy to use ValidatingAdmissionPolicy to prevent Node recreation (!4128) · Merge requests · Sylva-projects / sylva-core

This is a follow-up to !3811 (closed).

Context

In !3811 (closed) (see !3811 (comment 2412544687) in particular) we disabled the cluster-nodes-provider-id-blacklist Kyverno policy for kubadm because for the 1.30 to 1.31 upgrade, in the early phases of the setup of a node, kubelet only talk to the local apiserver, which can reach Kyverno webhook, which even with failurePolicy: Ignore set on the webhook, has side-effects (failure to set the control plane role label on the Node).

What this mechanism does (reminder)

With capm3, during node rolling updates we reuse baremetal servers and keep the same Node names (required for Longhorn to find its data).

We need to guard against a corner case though: when a server has been drained, before the machine is actually deleted, kubelet may restart and at least with RKE2, we know that the Node resource could be recreated, in which cases it will interfere with the same Node being created for the new machine replacing the Node.

We do this with Kyverno, ensuring that for all currently existing Nodes, ensures that no Node can be created with the same spec.providerID.

Today this is done with:

a first policy A that populates a ConfigMap keeping track of providerIDs
a second policy B that prevents the creation/update of Node setting spec.providerIDs to one of the old providerIDs tracked in the ConfigMap

What this MR does

This MR replaces the second policy B with a ValidatingAdmissionPolicy; this policy is applied directly by the apiserver without involving any webhook.

The others thing done by this MR is to update the first policy by adding CEL matchConditions to ensure that it will not trigger any webhook for the API calls done on the creation of a Node. This is done by having a spec.unschedulable: true as the trigger, which is sufficient to ensure that the provider-id blacklist is populated during node drain, which ensures that this happens before Node deletion. This allows to make the Kyverno policy failurePolicy: Fail, so that we're 100% sure it runs.

Last, we remove the cleanup policy because it's now useless (because we know that the Kyverno policy is always triggered).

How this was tested

Steps taken:

start from an existing RKE2 deployment
kubectl cordon node-x
observe that the nodes-provider-ids is populated with node-x providerID
kubectl delete node node-x
pkill kubelet on the server to forcefully restart kubelet, to have it attempt at registering the node again

Taking these steps with the code from this MR leads to observing that the Node is not recreated, with this in kubelet logs:

E0410 12:56:35.850894   25288 kubelet_node_status.go:95] "Unable to register node with API server" err="nodes \"management-cluster-md0-cv86x-ntmz2\" is forbidden: ValidatingAdmissionPolicy 'nodes-provider-id-blacklist' with binding 'nodes-provider-id-blacklist' denied request: node can't reuse a blacklisted providerID: 9a8397a0-f2b1-4bda-b769-4754c22cfba4" node="management-cluster-md0-cv86x-ntmz2"

(Taking these steps but with the VAP deleted, or after manually removing the providerID from the nodes-provider-ids ConfigMap, leads the Node being recreated at once after kubelet is restarted.)

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon	Meaning	Available values
☁️	Infra Provider	`capd`, `capo`, `capm3`
🚀	Bootstrap Provider	`kubeadm` (alias `kadm`), `rke2`
🐧	Node OS	`ubuntu`, `suse`
🛠️	Deployment Options	`light-deploy`, `dev-sources`, `ha`, `misc`, `maxsurge-0`
🎬	Pipeline Scenarios	Available scenario list and description

🎬 preview ☁️ capd 🚀 kadm 🐧 ubuntu
🎬 preview ☁️ capo 🚀 rke2 🐧 suse
🎬 preview ☁️ capm3 🚀 rke2 🐧 ubuntu
☁️ capm3 🚀 rke2 🐧 suse
☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 ubuntu
☁️ capm3 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu
☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.3.x 🛠️ ha 🐧 suse
☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 suse
☁️ capm3 🚀 kadm 🎬 sylva-upgrade-from-1.3.x 🛠️ ha 🐧 ubuntu

Global config for deployment pipelines

autorun pipelines
allow failure on pipelines
record sylvactl events

Notes:

Enabling autorun will make deployment pipelines to be run automatically without human interaction
Disabling allow failure will make deployment pipelines mandatory for pipeline success.
if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited Apr 11, 2025 by Thomas Morin

rewrite cluster-nodes-provider-id-blacklist Kyverno policy to use ValidatingAdmissionPolicy to prevent Node recreation