Add nvidia-container-runtime-hook path to config file
I understand CoreOS is in beta. This issue is a user report + feature request.
First: I'm having success with nvidia:driver/418.40.04-4.19.50-coreos-r1-coreos
on CoreOS stable 2135.5.0. This is wonderful! Thanks!
I've got a /etc/systemd/system/nvidia-driver.service
launching the driver container:
[Unit]
Description=NVIDIA driver
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill nvidia-driver
ExecStartPre=-/usr/bin/docker rm nvidia-driver
ExecStartPre=-/usr/sbin/modprobe ipmi_devintf
Environment=NVIDIA_VERSION=418.40.04
ExecStart=/bin/sh -c 'exec /usr/bin/docker run --privileged --pid=host --name=nvidia-driver -v /run/nvidia:/run/nvidia:shared nvidia/driver:$NVIDIA_VERSION-$(uname -r)-coreos --accept-license'
Easy! Kernel modules get loaded, docker exec nvidia-driver nvidia-smi
shows a GPU.
I struggled to get the nvidia
container runtime working. nvidia-container-runtime
would always fail, writing these debug messages to /var/log/nvidia-container-runtime
:
Running /run/nvidia/driver/usr/bin/nvidia-container-runtime
Using bundle file: /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/…/config.json
ERROR: inject NVIDIA hook: stat /usr/bin/nvidia-container-runtime-hook: no such file or directory
nvidia-container-runtime
produces that error when it can't find nvidia-container-runtime-hook
in PATH
or in /usr/bin
. PATH
is the intended solution, since of course /usr/bin
is read-only on CoreOS.
I could not get PATH
to make it through dockerd
into nvidia-container-runtime
. On CoreOS, dockerd
is normally launched from /run/systemd/system/docker.service
via:
ExecStart=/usr/bin/env PATH=${TORCX_BINDIR}:${PATH} ${TORCX_BINDIR}/dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock $DOCKER_SELINUX $DOCKER_OPTS $DOCKER_CGROUPS $DOCKER_OPT_BIP $DOCKER_OPT_MTU $DOCKER_OPT_IPMASQ
I tried overriding ExecStart
, I tried tinkering with Environment
, nothing. Could be user error. Ultimately I made a shell script /etc/nvidia-container-runtime.sh
:
#!/bin/sh -e
export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin:/run/nvidia/driver/usr/bin
exec /run/nvidia/driver/usr/bin/nvidia-container-runtime "$@"
I registered the shell script on start by setting DOCKER_OPTS
:
[Service]
Environment="DOCKER_OPTS=--add-runtime=nvidia=/etc/nvidia-container-runtime.sh"
…and I configured the runtime in /etc/nvidia-container-runtime/config.toml
:
[nvidia-container-cli]
root = "/run/nvidia/driver"
With this, docker run --runtime=nvidia --rm nvidia/cuda:9.2-base nvidia-smi
works
My feature request: nvidia-container-runtime
searches for nvidia-container-runtime-hook
inside PATH
. nvidia-container-runtime
already reads that config file for other purposes. I'd like a knob in that config file so I can eliminate my shell script, and preferably eliminate the dependence on PATH
altogether.
I think that would reduce NVIDIA-on-CoreOS to these three files:
# /etc/nvidia-container-runtime/config.toml
[nvidia-container-runtime]
path = "/run/nvidia/driver/usr/bin"
[nvidia-container-cli]
hook = "/run/nvidia/driver/usr/bin/nvidia-container-runtime-hook"
# /etc/systemd/system/docker.service.d/override.conf
[Service]
Environment="DOCKER_OPTS=--add-runtime=nvidia=/run/nvidia/driver/usr/bin/nvidia-container-runtime"
# /etc/systemd/system/nvidia-driver.service (above)