Add a test when driver and other nvidia components are already deployed in the cluster
Add a test for what happens when driver, device plugin are already deployed in the cluster. What is the expected behavior?
Adding the logs of this test done on a T4 server
Deploy gpu-operator
$ helm install https://nvidia.github.io/gpu-operator/gpu-operator-1.0.0-techpreview.1.tgz
Check operator resources
$ kubectl get pods --all-namespaces | grep operator
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources nvidia-container-toolkit-daemonset-djwlg 1/1 Running 0 71m
gpu-operator-resources nvidia-device-plugin-daemonset-hpvf6 1/1 Running 0 8m47s
gpu-operator-resources nvidia-driver-daemonset-wtr5b 0/1 CrashLoopBackOff 18 71m
gpu-operator-resources nvidia-driver-validation 0/1 Completed 0 59m
gpu-operator special-resource-operator-78c7499d65-chj4h 1/1 Running 0 3h7m
Deploying driver container should fail -> No explicit error message but user will have to go and check the pods
$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-djwlg 1/1 Running 0 72m
nvidia-device-plugin-daemonset-hpvf6 1/1 Running 0 9m35s
nvidia-driver-daemonset-wtr5b 0/1 CrashLoopBackOff 18 71m
nvidia-driver-validation 0/1 Completed 0 60m
$ kubectl logs nvidia-driver-daemonset-* -n gpu-operator-resources
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 418.40.04 for Linux kernel version 4.15.0-55-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use