Skip to content

Draft: Enable the use of CDI on OpenShift

Christopher Desiniotis requested to merge cdi-on-ocp into master

Signed-off-by: Christopher Desiniotis cdesiniotis@nvidia.com

This MR configures the toolkit container and device-plugin appropriately so that CDI can be used to provide GPU access to both management and application containers on OpenShift.

On OpenShift, we cannot set 'nvidia' as the default runtime. Because of this, we have decided to take a hybrid approach for enabling CDI. That is, we use the 'nvidia' runtime configured in CDI mode for providing GPU access to management containers, and native CDI support in CRI-O for providing GPU access to application containers.

// gpu-operator is installed and cdi.enabled=true AND cdi.default=true.
// Configuring CDI as default means that the runtime named 'nvidia' is 
// configured in CDI mode -- which means our operands get GPU access via CDI
// (this is not strictly required)
[core@ocp414-chris ~]$ oc get pods
NAME                                                              READY   STATUS      RESTARTS        AGE
033e27a79c37d2326e950294c9249b0db5077d71489b63574b027872fdh9plh   0/1     Completed   0               123m
gpu-feature-discovery-xnpcr                                       1/1     Running     0               6m28s
gpu-operator-65957ccc56-sxv4h                                     1/1     Running     0               88m
nvidia-container-toolkit-daemonset-k8znw                          1/1     Running     0               6m28s
nvidia-cuda-validator-gdk6h                                       0/1     Completed   0               3m48s
nvidia-dcgm-dr5pn                                                 1/1     Running     0               6m28s
nvidia-dcgm-exporter-42k8n                                        1/1     Running     2 (3m31s ago)   6m28s
nvidia-device-plugin-daemonset-46cf4                              1/1     Running     0               6m28s
nvidia-driver-daemonset-414.92.202312011602-0-6s7tw               2/2     Running     0               7m28s
nvidia-node-status-exporter-l2frs                                 1/1     Running     0               87m
nvidia-operator-validator-zthg4                                   1/1     Running     0               6m28s
quay-io-cdesiniotis1-gpu-operator-bundle-devel                    1/1     Running     0               123m

// no hook was installed
[core@ocp414-chris ~]$ ls -ltr /usr/share/containers/oci/hooks.d/
total 0

[core@ocp414-chris ~]$ sudo ls -ltr /run/containers/oci/hooks.d
total 0

// nvidia is not set as the default runtime
[core@ocp414-chris ~]$ cat /etc/crio/crio.conf.d/99-nvidia.conf

[crio]

  [crio.runtime]

    [crio.runtime.runtimes]

      [crio.runtime.runtimes.nvidia]
        runtime_path = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        runtime_type = "oci"

      [crio.runtime.runtimes.nvidia-cdi]
        runtime_path = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
        runtime_type = "oci"

      [crio.runtime.runtimes.nvidia-legacy]
        runtime_path = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
        runtime_type = "oci"


[core@ocp414-chris ~]$ ls -ltr /var/run/cdi/
total 24
-rw-------. 1 root root 9074 Feb 14 00:53 management.nvidia.com-gpu.yaml
-rw-------. 1 root root 9181 Feb 14 00:55 k8s.device-plugin.nvidia.com-gpu.json
Edited by Christopher Desiniotis

Merge request reports