Skip to content

Add a test when driver and other nvidia components are already deployed in the cluster

Add a test for what happens when driver, device plugin are already deployed in the cluster. What is the expected behavior?

Adding the logs of this test done on a T4 server

Deploy gpu-operator

$ helm install https://nvidia.github.io/gpu-operator/gpu-operator-1.0.0-techpreview.1.tgz

Check operator resources

$ kubectl get pods --all-namespaces | grep operator

NAMESPACE                NAME                     READY STATUS RESTARTS AGE

gpu-operator-resources   nvidia-container-toolkit-daemonset-djwlg     1/1 Running 0 71m

gpu-operator-resources   nvidia-device-plugin-daemonset-hpvf6         1/1 Running 0 8m47s

gpu-operator-resources   nvidia-driver-daemonset-wtr5b                0/1 CrashLoopBackOff 18 71m

gpu-operator-resources   nvidia-driver-validation                     0/1 Completed 0 59m

gpu-operator             special-resource-operator-78c7499d65-chj4h   1/1 Running 0 3h7m

Deploying driver container should fail -> No explicit error message but user will have to go and check the pods

$ kubectl get pods -n gpu-operator-resources

NAME                                       READY STATUS RESTARTS AGE

nvidia-container-toolkit-daemonset-djwlg   1/1 Running 0 72m

nvidia-device-plugin-daemonset-hpvf6       1/1 Running 0 9m35s

nvidia-driver-daemonset-wtr5b              0/1 CrashLoopBackOff   18         71m

nvidia-driver-validation                   0/1 Completed 0 60m

$ kubectl logs nvidia-driver-daemonset-* -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 418.40.04 for Linux kernel version 4.15.0-55-generic

Stopping NVIDIA persistence daemon...

Unloading NVIDIA driver kernel modules...

Could not unload NVIDIA driver kernel modules, driver is in use

Stopping NVIDIA persistence daemon...

Unloading NVIDIA driver kernel modules...

Could not unload NVIDIA driver kernel modules, driver is in use