`runtime` image misses `libdevice` required for model training in TensorFlow
I'm training a model using TensorFlow based of a nvcr.io/nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
image. Training (i.e model.fit()
) fails during initialisation with the follow error:
2023-05-24 17:19:20.800561: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-05-24 17:19:20.804495: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
I can see that the image includes lbnvidia-nvvm
, but not libdevice
:
/usr/local/cuda-11.8/compat# ll
total 145724
drwxr-xr-x 2 root root 4096 Feb 2 05:19 ./
drwxr-xr-x 1 root root 4096 Feb 2 05:25 ../
lrwxrwxrwx 1 root root 12 Sep 29 2022 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 20 Sep 29 2022 libcuda.so.1 -> libcuda.so.520.61.05
-rw-r--r-- 1 root root 26284256 Sep 29 2022 libcuda.so.520.61.05
lrwxrwxrwx 1 root root 28 Sep 29 2022 libcudadebugger.so.1 -> libcudadebugger.so.520.61.05
-rw-r--r-- 1 root root 10934360 Sep 29 2022 libcudadebugger.so.520.61.05
lrwxrwxrwx 1 root root 19 Sep 29 2022 libnvidia-nvvm.so -> libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root 27 Sep 29 2022 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.520.61.05
-rw-r--r-- 1 root root 92017376 Sep 29 2022 libnvidia-nvvm.so.520.61.05
lrwxrwxrwx 1 root root 37 Sep 29 2022 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.520.61.05
-rw-r--r-- 1 root root 19963864 Sep 29 2022 libnvidia-ptxjitcompiler.so.520.61.05
I can workaround it by manually installing nvcc
package inside the runtime
image:
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-nvcc-11-8_11.8.89-1_amd64.deb && \
dpkg --ignore-depends=cuda-cudart-dev-11-8 -i cuda-nvcc-11-8_11.8.89-1_amd64.deb
Training also works if I use devel
version, however, I want to stick with runtime
, as it's much smaller and I don't need any of the compiling capabilities or dev tools.
Would it make sense to add the missing package to the runtime
image?
Specs
Python and TensorFlow are installed inside Docker image. The following versions were used:
- TensorFlow version: 2.12.0 (installed via
pip
) - Python version: 3.11.3
- GPU: NVIDIA A10G (24 GB)