Enhance the reliability of rke2-metrics-server deployment

What does this MR do and why?

Closes #1857 (closed)

Issue Summary

The rke2-metrics-server chart was fragile due to:

  • No PodDisruptionBudget: eviction during node drains could cause full downtime.
  • Single replica: no high availability.
  • Rolling updates were not smooth (service update → downtime until new pod ready).
  • Kubernetes API server relies on this APIService, so any downtime risks breaking cluster functions.

Solution

The installation of rke2-metrics-server will be disabled in the CABPR bootstrap provider sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster!654 (merged). Instead, it will be handled by introducing new units named rke2-metrics-server and rke2-metrics-server-ha.

These units will handle the following:

  • High Availability Enabled: replicas: 2 ensures the metrics-server is redundant, avoiding single points of failure.
  • Targeted Node Scheduling: node-role.kubernetes.io/control-plane: "true"
  • Safe Rolling Updates: updateStrategy with maxSurge: 1, maxUnavailable: 1 ensures zero downtime during upgrades by maintaining at least one available replica at all times.
  • Pod Disruption Budget (PDB) Configured: Prevents voluntary disruptions (e.g., node drain) from evicting all pods at once, ensuring at least one instance always remains available.
  • Anti-Affinity Rules: Ensures that replicas are spread across different nodes (based on kubernetes.io/hostname), improving fault-tolerance and reducing correlated failure risk

Related reference(s)

Test coverage

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon Meaning Available values
☁️ Infra Provider capd, capo, capm3
🚀 Bootstrap Provider kubeadm (alias kadm), rke2
🐧 Node OS ubuntu, suse
🛠️ Deployment Options light-deploy, dev-sources, ha, misc, maxsurge-0
🎬 Pipeline Scenarios Available scenario list and description
  • ☁️ capo 🚀 rke2 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.3.x 🛠️ ha,misc 🐧 suse

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.3.x 🛠️ misc,ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🐧 suse

  • ☁️ capm3 🚀 rke2 🐧 ubuntu

Global config for deployment pipelines

  • autorun pipelines
  • allow failure on pipelines
  • record sylvactl events

Notes:

  • Enabling autorun will make deployment pipelines to be run automatically without human interaction
  • Disabling allow failure will make deployment pipelines mandatory for pipeline success.
  • if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited by Thomas Morin

Merge request reports

Loading