Add nvidia gpu operator

What does this MR do and why?

This MR adds a new sylva unit encapsulating NVIDIA GPU Operator that contains all components needed to provision Nvidia GPUs: driver, container runtime, device plugin and monitoring support. It allows:

  • Allocation of an entire NVIDIA GPU to a pod in bare-metal workload clusters.
  • Collection of GPU metrics (notably utilization, power, temperature) by Prometheus if monitoring is enabled within the cluster

Details

  • The operator is installing by default a Node Feature Discovery component, which may be redundant with the one provided by SRIOV. Hence test is done to install it only if necessary
  • Since the container socket path is required, depends on the provider and is used in 2 units, a code factorization enables to define it only in one place (.Values.internal.container_runtime_settings)

Closes #1029 (closed)

Test coverage

  • Operator own tests once it is deployed

  • Deployment on a capm3 platform with GPU

    Test / platform RKE2 with ubuntu RKE2 with opensuse
    Simple CUDA operation
    vLLM installation with a 7b model Not tested (privileged mode only)

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon Meaning Available values
☁️ Infra Provider capd, capo, capm3
🚀 Bootstrap Provider kubeadm (alias kadm), rke2
🐧 Node OS ubuntu, suse
🛠️ Deployment Options light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging
🎬 Pipeline Scenarios Available scenario list and description
  • 🎬 preview ☁️ capd 🚀 kadm 🐧 ubuntu
  • 🎬 preview ☁️ capo 🚀 rke2 🐧 suse
  • 🎬 preview ☁️ capm3 🚀 rke2 🐧 ubuntu 🟢 wkld:nvidia-gpu-operator
  • ☁️ capd 🚀 kadm 🛠️ light-deploy 🐧 ubuntu
  • ☁️ capd 🚀 rke2 🛠️ light-deploy 🐧 suse 🟢 wkld:nvidia-gpu-operator
  • ☁️ capo 🚀 rke2 🐧 suse
  • ☁️ capo 🚀 kadm 🐧 ubuntu 🟢 wkld:nvidia-gpu-operator
  • ☁️ capo 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu
  • ☁️ capo 🚀 kadm 🎬 wkld-k8s-upgrade 🐧 ubuntu
  • ☁️ capo 🚀 rke2 🎬 rolling-update-no-wkld 🛠️ ha 🐧 suse
  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 ubuntu
  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 ubuntu
  • ☁️ capo 🚀 rke2 🛠️ ha,misc 🐧 ubuntu
  • ☁️ capm3 🚀 rke2 🐧 suse 🟢 wkld:nvidia-gpu-operator
  • ☁️ capm3 🚀 kadm 🐧 ubuntu 🟢 wkld:nvidia-gpu-operator
  • ☁️ capm3 🚀 kadm 🎬 rolling-update-no-wkld 🛠️ ha,misc 🐧 ubuntu
  • ☁️ capm3 🚀 rke2 🎬 wkld-k8s-upgrade 🛠️ ha 🐧 suse
  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 ubuntu
  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 suse
  • ☁️ capm3 🚀 rke2 🛠️ misc,ha 🐧 suse
  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 suse
  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 suse
  • ☁️ capm3 🚀 ck8s 🎬 no-wkld 🛠️ light-deploy 🐧 ubuntu

Global config for deployment pipelines

  • autorun pipelines

  • allow failure on pipelines

  • record sylvactl events

Notes:

  • Enabling autorun will make deployment pipelines to be run automatically without human interaction
  • Disabling allow failure will make deployment pipelines mandatory for pipeline success.
  • if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited by vladimir braquet

Merge request reports

Loading