Driver container is not recoverable.

background - OS distribution and nvidia driver versions: centos7

[centos@ip-10-0-129-52 ~]$ uname -r
3.10.0-1062.4.1.el7.x86_64
[centos@ip-10-0-129-52 ~]$ KERNEL_VERSION=$(uname -r) && yum -q list available --show-duplicates kernel-headers | awk -v arch=$(uname -m) 'NR>1 {print $2"."arch}' | tac | grep -E -m1 "^${KERNEL_VERSION/latest/.*}"
3.10.0-1062.4.1.el7.x86_64

driver container version: nvidia/driver:418.87.01-centos7 device plugin version: nvidia/k8s-device-plugin/1.0.0-beta4

Problem Description The nvidia/driver is crash looping in two nodes when we try to scale up the GPU machines in the cluster.

logs:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 418.40.04 for Linux kernel version 3.10.0-1062.4.1.el7.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

This is not a kernel and kernel-headers mismatch issue, because we have done pre-check above and make sure we have the right kernel version and kernel-headers version on the host.

One hypothesis is that the /etc/resolv.conf generated by the kubelet under /var/lib/kubelet/... is not mounted by the nvidia-container-runtime into the rootfs, so that yum list would fail due to the container cannot resolve the DNS. As a result, Could not resolve Linux kernel version is returned. Cannot prove it because I was not able to exec into the pod since it is crash looping.

This issue is hard to reproduce. I haven't found a way to repro it consistently.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information