cuda issueshttps://gitlab.com/nvidia/container-images/cuda/-/issues2023-04-27T01:37:57Zhttps://gitlab.com/nvidia/container-images/cuda/-/issues/195nvidia/cuda 11.4 cudnn container missing arm64 platforms2023-04-27T01:37:57ZANDREW Rampullanvidia/cuda 11.4 cudnn container missing arm64 platformsNone of the Ubuntu 20.04 CUDA 11.4.(0-3) docker containers are built for arm64 platform. This makes it impossible to use cuDNN on an Orin AGX.None of the Ubuntu 20.04 CUDA 11.4.(0-3) docker containers are built for arm64 platform. This makes it impossible to use cuDNN on an Orin AGX.Jesus AlvarezJesus Alvarezhttps://gitlab.com/nvidia/container-images/cuda/-/issues/129APT Mirror issue with cuda.list2022-06-29T10:46:20ZsebastienmaschaAPT Mirror issue with cuda.list# Description
Using 11.3.1-base-ubuntu20.04, I am not able to run `apt update` because of the mirror from `/etc/apt/source.list.d/cuda.list` which is: `deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 /`.
#...# Description
Using 11.3.1-base-ubuntu20.04, I am not able to run `apt update` because of the mirror from `/etc/apt/source.list.d/cuda.list` which is: `deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 /`.
# To reproduce this error
Start the container:
`docker run --runtime nvidia -it nvidia/cudagl:11.3.1-base-ubuntu20.04`
Or: `docker run -it nvidia/cudagl:11.3.1-base-ubuntu20.04`
Run: `apt update`
# Logs
```
E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/Packages.gz File has unexpected size (343435 != 345084). Mirror sync in progress? [IP: 152.195.19.142 443]
Hashes of expected file:
- Filesize:345084 [weak]
- SHA256:e9af61c4b2f44d714b157c49e40e92f98f50c98ef7d0fa45fc52ddd47947549a
- SHA1:79bf92660b1b0560c26d0d517c78778cad8a51ce [weak]
- MD5Sum:5a2b1f6d478629f661cc4149f9f2a602 [weak]
Release file created at: Tue, 06 Jul 2021 23:02:03 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.
```https://gitlab.com/nvidia/container-images/cuda/-/issues/192nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 missing /usr/local/cuda-11.8/ta...2023-10-13T15:16:54Zjamesruzewskinvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 missing /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.sonvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 is missing the symbolic link /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so to /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so.11.2
When we run tensorflow in the docker...nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 is missing the symbolic link /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so to /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so.11.2
When we run tensorflow in the docker image we get:
Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory
If we switch to nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 tensorflow run as normal and we can see /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so exists.
Also if we add the symbolic link to our docker image:
cd /usr/local/cuda/targets/x86_64-linux/lib && \
ln -sv libnvrtc.so.11.2 libnvrtc.so
tensorflow will work properly.
We been using nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 with tensorflow with no issues until the update to the runtiume image on Feb 2, 2023https://gitlab.com/nvidia/container-images/cuda/-/issues/179libnvrtc11.2 does not provide libnvrtc.so2023-10-11T17:07:21ZGuillaume Desmotteslibnvrtc11.2 does not provide libnvrtc.soIs there any reason why `libnvrtc.so.11.2` is provided by `libnvrtc11.2` but the symlink pointing to it allowing to actually use the library is provided by `nvidia-cuda-dev`?
```console
# dpkg -S /usr/lib/x86_64-linux-gnu/libnvrtc.so
nv...Is there any reason why `libnvrtc.so.11.2` is provided by `libnvrtc11.2` but the symlink pointing to it allowing to actually use the library is provided by `nvidia-cuda-dev`?
```console
# dpkg -S /usr/lib/x86_64-linux-gnu/libnvrtc.so
nvidia-cuda-dev:amd64: /usr/lib/x86_64-linux-gnu/libnvrtc.so
# dpkg -S /usr/lib/x86_64-linux-gnu/libnvrtc.so.11.2
libnvrtc11.2:amd64: /usr/lib/x86_64-linux-gnu/libnvrtc.so.11.2
```https://gitlab.com/nvidia/container-images/cuda/-/issues/183CUDA `11.x` images cannot work with non-Tesla cards due to the `cuda-compat` ...2023-07-13T23:18:57ZGeorge AlexopoulosCUDA `11.x` images cannot work with non-Tesla cards due to the `cuda-compat` package### Context on `cuda-compat`
See the first paragraph of https://gitlab.com/nvidia/container-images/cuda/-/issues/182.
### Non-Tesla GPU cards (except for some RTX cards) can't run any `11.x` upstream image
Forward compatibility is onl...### Context on `cuda-compat`
See the first paragraph of https://gitlab.com/nvidia/container-images/cuda/-/issues/182.
### Non-Tesla GPU cards (except for some RTX cards) can't run any `11.x` upstream image
Forward compatibility is only supported for Tesla (data center) cards and some RTX cards:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatible-upgrade
> Forward Compatibility is applicable only for systems with NVIDIA Data Center GPUs or select [NGC Server Ready](https://docs.nvidia.com/ngc/ngc-ready-systems/index.html#abstract) SKUs of RTX cards. It’s mainly intended to support applications built on newer CUDA Toolkits to run on systems installed with an older NVIDIA Linux GPU driver from different major release families. This new forward-compatible upgrade path requires the use of a special package called “CUDA compat package”.
See also:
1. https://github.com/NVIDIA/nvidia-docker/issues/1515#issuecomment-872962686
2. https://github.com/NVIDIA/nvidia-docker/issues/1515#issuecomment-872974709
Therefore, CUDA `11.x` images can currently only work with:
- all Tesla cards (only with some specific driver versions, see https://gitlab.com/nvidia/container-images/cuda/-/issues/181)
- Kepler cards (only tags < 11.6)
I have not tested this, but it should hold according to the NVIDIA docs, and I understand your team wants to be pedantic w.r.t. accurately following what the docs support.https://gitlab.com/nvidia/container-images/cuda/-/issues/190Can’t apt-get update of nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.042023-07-13T23:19:12Zjack-gitsCan’t apt-get update of nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04I’m using nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04 as the base image.
but Unfortunately I met an block exception when ‘apt-get update’.
below is the error message:
`root@48882292c8e8:/# apt-get update
Get:1 http://security.ubuntu....I’m using nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04 as the base image.
but Unfortunately I met an block exception when ‘apt-get update’.
below is the error message:
`root@48882292c8e8:/# apt-get update
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Err:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
Err:2 http://archive.ubuntu.com/ubuntu jammy InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64 InRelease [1581 B]
Err:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
Err:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64 InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
Get:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [107 kB]
Err:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
Reading package lists... Done
W: http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2012-cdimage.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: GPG error: http://security.ubuntu.com/ubuntu jammy-security InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
E: The repository 'http://security.ubuntu.com/ubuntu jammy-security InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: http://archive.ubuntu.com/ubuntu/dists/jammy/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2012-cdimage.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: http://archive.ubuntu.com/ubuntu/dists/jammy/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: GPG error: http://archive.ubuntu.com/ubuntu jammy InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
E: The repository 'http://archive.ubuntu.com/ubuntu jammy InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2012-cdimage.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: GPG error: http://archive.ubuntu.com/ubuntu jammy-updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
E: The repository 'http://archive.ubuntu.com/ubuntu jammy-updates InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: http://archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2012-cdimage.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: http://archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg are ignored as the file is not readable by user '_apt' executing apt-key.
W: GPG error: http://archive.ubuntu.com/ubuntu jammy-backports InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 871920D1991BC93C
E: The repository 'http://archive.ubuntu.com/ubuntu jammy-backports InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: Problem executing scripts APT::Update::Post-Invoke 'rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true'
E: Sub-process returned an error code`
I’m trying to fix the issue “NO_PUBKEY”, but met another error as below:
`root@48882292c8e8:/# apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)).
Executing: /tmp/apt-key-gpghome.RaZONDMNB5/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
gpg: connecting dirmngr at '/tmp/apt-key-gpghome.RaZONDMNB5/S.dirmngr' failed: End of file
gpg: keyserver receive failed: No dirmngr
W: The key(s) in the keyring /etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg are ignored as the file is not readable by user '' executing apt-key.`https://gitlab.com/nvidia/container-images/cuda/-/issues/202Missing 'cudnn8' variants for the 12.1.0 images2023-10-20T21:04:55ZYu-Hang "Maxin" TangMissing 'cudnn8' variants for the 12.1.0 imagesKitmaker Containershttps://gitlab.com/nvidia/container-images/cuda/-/issues/201`runtime` image misses `libdevice` required for model training in TensorFlow2023-08-16T13:20:19ZGonçalo Figueira`runtime` image misses `libdevice` required for model training in TensorFlow
I'm training a model using TensorFlow based of a `nvcr.io/nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04` image. Training (i.e `model.fit()`) fails during initialisation with the follow error:
```
2023-05-24 17:19:20.800561: W tensorflow...
I'm training a model using TensorFlow based of a `nvcr.io/nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04` image. Training (i.e `model.fit()`) fails during initialisation with the follow error:
```
2023-05-24 17:19:20.800561: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-05-24 17:19:20.804495: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
```
I can see that the image includes `lbnvidia-nvvm`, but not `libdevice`:
```
/usr/local/cuda-11.8/compat# ll
total 145724
drwxr-xr-x 2 root root 4096 Feb 2 05:19 ./
drwxr-xr-x 1 root root 4096 Feb 2 05:25 ../
lrwxrwxrwx 1 root root 12 Sep 29 2022 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 20 Sep 29 2022 libcuda.so.1 -> libcuda.so.520.61.05
-rw-r--r-- 1 root root 26284256 Sep 29 2022 libcuda.so.520.61.05
lrwxrwxrwx 1 root root 28 Sep 29 2022 libcudadebugger.so.1 -> libcudadebugger.so.520.61.05
-rw-r--r-- 1 root root 10934360 Sep 29 2022 libcudadebugger.so.520.61.05
lrwxrwxrwx 1 root root 19 Sep 29 2022 libnvidia-nvvm.so -> libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root 27 Sep 29 2022 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.520.61.05
-rw-r--r-- 1 root root 92017376 Sep 29 2022 libnvidia-nvvm.so.520.61.05
lrwxrwxrwx 1 root root 37 Sep 29 2022 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.520.61.05
-rw-r--r-- 1 root root 19963864 Sep 29 2022 libnvidia-ptxjitcompiler.so.520.61.05
```
I can workaround it by manually installing `nvcc` package inside the `runtime` image:
```
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-nvcc-11-8_11.8.89-1_amd64.deb && \
dpkg --ignore-depends=cuda-cudart-dev-11-8 -i cuda-nvcc-11-8_11.8.89-1_amd64.deb
```
Training also works if I use `devel` version, however, I want to stick with `runtime`, as it's much smaller and I don't need any of the compiling capabilities or dev tools.
Would it make sense to add the missing package to the `runtime` image?
## Specs
Python and TensorFlow are installed inside Docker image. The following versions were used:
- TensorFlow version: 2.12.0 (installed via `pip`)
- Python version: 3.11.3
- GPU: NVIDIA A10G (24 GB)https://gitlab.com/nvidia/container-images/cuda/-/issues/200NVIDIA_REQUIRE_CUDA was never fixed in 11.5.2 versions2023-06-23T22:16:28ZHarry MallonNVIDIA_REQUIRE_CUDA was never fixed in 11.5.2 versions`NVIDIA_REQUIRE_CUDA` was fixed to include non-tesla brands in lots of places (https://gitlab.com/nvidia/container-images/cuda/-/commit/d8b9b5554b4e051d97a9765ff6ff19089782c364) but not in the following places
* 11.5.1-rockylinux8
* 11....`NVIDIA_REQUIRE_CUDA` was fixed to include non-tesla brands in lots of places (https://gitlab.com/nvidia/container-images/cuda/-/commit/d8b9b5554b4e051d97a9765ff6ff19089782c364) but not in the following places
* 11.5.1-rockylinux8
* 11.5.2-{centos7, rockylinux8, ubi7, ubi8, ubuntu2004}
```
# was
ENV NVIDIA_REQUIRE_CUDA "cuda>=11.5 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471"
# should be
ENV NVIDIA_REQUIRE_CUDA "cuda>=11.5 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471"
```https://gitlab.com/nvidia/container-images/cuda/-/issues/197devel images don't include nsight2023-07-13T23:19:51ZJim Meltondevel images don't include nsightI'm using `nvidia/cuda:12.1.0-devel-ubi9`, but as near as I can tell, this issue is common to all images.
Running on an A100, `nvprof` reports
> ======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher...I'm using `nvidia/cuda:12.1.0-devel-ubi9`, but as near as I can tell, this issue is common to all images.
Running on an A100, `nvprof` reports
> ======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher.
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
Refer https://developer.nvidia.com/tools-overview for more details.
Digging through the Nsight marketing (documentation), it claims that Nsight is packaged with the CUDA Toolkit. However, `ncu` is nowhere to be found in this image. Shouldn't a **devel** image include profiling support?https://gitlab.com/nvidia/container-images/cuda/-/issues/222README points to nvidia container runtime now deprecated2024-03-13T16:36:34Zjavier von der pahlenREADME points to nvidia container runtime now deprecatedYour readme states:
"Usage of the CUDA container images requires the [Nvidia Container Runtime](https://github.com/NVIDIA/nvidia-container-runtime)."
Following your own deprectation notice it should read
"Usage of the CUDA container ...Your readme states:
"Usage of the CUDA container images requires the [Nvidia Container Runtime](https://github.com/NVIDIA/nvidia-container-runtime)."
Following your own deprectation notice it should read
"Usage of the CUDA container images requires the [Nvidia Container Tookit](https://github.com/NVIDIA/nvidia-container-toolkit)."https://gitlab.com/nvidia/container-images/cuda/-/issues/217cuDNN update for cuda 11.8.0-cudnn8-runtime-ubuntu22.042023-12-04T12:06:56ZPetr ChmelařcuDNN update for cuda 11.8.0-cudnn8-runtime-ubuntu22.04Hi!
I would like to just ask if there are any plans to update cudnn in ubuntu based cuda images like `11.8.0-cudnn8-runtime-ubuntu22.04`.
The reason why I'm asking is that 5 months ago the `11.8.0-cudnn8-runtime-ubuntu22.04` was update...Hi!
I would like to just ask if there are any plans to update cudnn in ubuntu based cuda images like `11.8.0-cudnn8-runtime-ubuntu22.04`.
The reason why I'm asking is that 5 months ago the `11.8.0-cudnn8-runtime-ubuntu22.04` was updated including cuddn verison from 8.7.0 to 8.9.0.
That is probably OK but the problem is that the 8.9.0 has several issues related to GPU architectures that leads to incorrect results during models inference.
Specifically, we are affected by the issues related to Pascal based gpus that was fixed in cuDNN 8.9.2.
There is another similar issue related to Maxwell GPUS (fxied in 8.9.3) that does not affected our environment but could be affecting other users of cuda images.
> Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 15 for fused convolutions with bias and ReLU could generate incorrect results on the Pascal GPU architectures for cuDNN 8.9 releases. This issue is now fixed in 8.9.2. Users of cudnnConvolutionBiasActivationForward would have been similarly affected.
https://docs.nvidia.com/deeplearning/cudnn/release-notes/index.html#rel-892__section_hgw_fsc_pxb
Thanks in advancehttps://gitlab.com/nvidia/container-images/cuda/-/issues/215Pulling nvcr.io/nvidia/pytorch:23.08-py3 fails with digest error2023-09-13T22:06:58ZMichael TaronPulling nvcr.io/nvidia/pytorch:23.08-py3 fails with digest errorHello!
I couldn't find the source for this container on GitHub or GitLab, so apolgies if this is not the right place to report.
If you run `docker pull nvcr.io/nvidia/pytorch:23.08-py3` , it fails with `filesystem layer verification fa...Hello!
I couldn't find the source for this container on GitHub or GitLab, so apolgies if this is not the right place to report.
If you run `docker pull nvcr.io/nvidia/pytorch:23.08-py3` , it fails with `filesystem layer verification failed for digest sha256:6c9f88339e6283cb72af66d47db4818e17155ef072be34fddd0df91b5305de52`
Happens locally or when trying to pull inside a Kubernetes cluster.
Thanks!https://gitlab.com/nvidia/container-images/cuda/-/issues/214How to know required driver version for particular docker image?2023-08-10T00:16:43ZMyungHa KwonHow to know required driver version for particular docker image?I want to know what driver version is required for particular docker image.
Now, 460.73 version driver is installed and failed to run 11.5.2, succeeded to run 11.4.3.
It says
```docker: Error response from daemon: OCI runtime create ...I want to know what driver version is required for particular docker image.
Now, 460.73 version driver is installed and failed to run 11.5.2, succeeded to run 11.4.3.
It says
```docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.5, please update your driver to a newer version, or use an earlier cuda container: unknown.```
I found https://gitlab.com/nvidia/container-images/cuda/-/blob/master/doc/container_tags.pdf but it doesn't give me complete information.
By looking at the doc, I can't figure out why 11.5.2 failed and 11.4.3 didn't.
How can I find required driver version spec?https://gitlab.com/nvidia/container-images/cuda/-/issues/208NCCL for CUDA 12.22023-10-20T20:39:17ZJesus AlvarezNCCL for CUDA 12.2The NCCL team has not yet shipped for CUDA 12.2. This will be added once a release is made.The NCCL team has not yet shipped for CUDA 12.2. This will be added once a release is made.https://gitlab.com/nvidia/container-images/cuda/-/issues/207cuDNN for CUDA 12.22023-10-22T23:12:46ZJesus AlvarezcuDNN for CUDA 12.2The cuDNN team has not yet shipped for CUDA 12.2. This will be added once a release is made.The cuDNN team has not yet shipped for CUDA 12.2. This will be added once a release is made.https://gitlab.com/nvidia/container-images/cuda/-/issues/206CUDA 12.2 Conainer Images2023-06-30T18:18:08ZJesus AlvarezCUDA 12.2 Conainer ImagesProvide container images for 12.2Provide container images for 12.2https://gitlab.com/nvidia/container-images/cuda/-/issues/205Issues with 11.8.0-devel-ubuntu22.042023-06-20T23:34:22ZCaptur AIIssues with 11.8.0-devel-ubuntu22.04Hi!
I'm using "11.8.0-devel-ubuntu22.04" for quite a while and I'm able to produce great results with the model I'm training.
In the last two days I'm having lower results, I saw that these results were happening since the moment "putep...Hi!
I'm using "11.8.0-devel-ubuntu22.04" for quite a while and I'm able to produce great results with the model I'm training.
In the last two days I'm having lower results, I saw that these results were happening since the moment "putepackagin363" pushed an update to "11.8.0-devel-ubuntu22.04" tag.
Is there any way you guys can provide me with the previous digest ID so I'd be able to fetch it instead? or even share that changes were made?https://gitlab.com/nvidia/container-images/cuda/-/issues/199Cuda Image Samples2023-04-26T23:19:43ZJesus AlvarezCuda Image SamplesDockerfile samples for CUDA images that can be easily built by users using the build.sh script.Dockerfile samples for CUDA images that can be easily built by users using the build.sh script.Kitmaker Containershttps://gitlab.com/nvidia/container-images/cuda/-/issues/198Cant't install nvidia-docker in Fedora 37 due to libnvidia-ml.so.1 file2023-04-24T17:46:04ZPedro CoutoCant't install nvidia-docker in Fedora 37 due to libnvidia-ml.so.1 file### 1. Issue or feature description
Upon running the command `docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi` i get the error
```
docker: Error response from daemon: failed to create shim task: OCI ru...### 1. Issue or feature description
Upon running the command `docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi` i get the error
```
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
```
### 2. Steps to reproduce the issue
I installed nvidia through Fedora Docs, not Nvidia, so as an example `nvcc --version` outputs an error saying that it does not recognize nvcc command but in my host machine I can run `nvidia-smi`
The commands I used to install nvidia are the following:
```
sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda
```
And as visible in the following image I am able to run the command `nvidia-smi` in my host machine
![image](https://user-images.githubusercontent.com/69256195/232839043-0540a7dd-badf-4880-b2c5-f45915bded94.png)
I followed this guide on how yo install nvidia-docker - - and did the following:
```
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
##############################
sudo dnf install nvidia-docker2
# Edit /etc/nvidia-container-runtime/config.toml and disable cgroups:
no-cgroups = true
sudo reboot
##############################
sudo systemctl start docker.service
##############################
docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
```
and upon running this docker command I get the error show in ### 1.
The thing is, I have the file that it says it is missing (check the following image), so maybe it is looking for it in a different directory?
![image](https://user-images.githubusercontent.com/69256195/232840112-21741908-358f-4d36-ae49-e698d291e465.png)
### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)
`uname -a`:
```
Linux fedora 6.2.10-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 6 23:30:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
```
---
`docker version`
```
Client: Docker Engine - Community
Cloud integration: v1.0.31
Version: 23.0.3
API version: 1.41 (downgraded from 1.42)
Go version: go1.19.7
Git commit: 3e7cbfd
Built: Tue Apr 4 22:10:33 2023
OS/Arch: linux/amd64
Context: desktop-linux
Server: Docker Desktop 4.18.0 (104112)
Engine:
Version: 20.10.24
API version: 1.41 (minimum version 1.12)
Go version: go1.19.7
Git commit: 5d6db84
Built: Tue Apr 4 18:18:42 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.18
GitCommit: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
```
---
`rpm -qa '*nvidia*'`
```
nvidia-gpu-firmware-20230310-148.fc37.noarch
xorg-x11-drv-nvidia-kmodsrc-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.x86_64
nvidia-settings-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-power-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-530.41.03-1.fc37.x86_64
akmod-nvidia-530.41.03-1.fc37.x86_64
kmod-nvidia-6.2.9-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-persistenced-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.i686
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.i686
kmod-nvidia-6.2.10-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-container-toolkit-base-1.13.0-1.x86_64
libnvidia-container1-1.13.0-1.x86_64
libnvidia-container-tools-1.13.0-1.x86_64
nvidia-container-toolkit-1.13.0-1.x86_64
nvidia-docker2-2.13.0-1.noarch
```
---
`nvidia-container-cli -V`
```
cli-version: 1.13.0
lib-version: 1.13.0
build date: 2023-03-31T13:12+00:00
build revision: 20823911e978a50b33823a5783f92b6e345b241a
build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-18)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
```
Thanks for your help!