An "Assertion failed" error occurs in GROMACS 2023.1 when CUDA Graphs feature and gmx mdrun "-bonded gpu" argument available at the same time
Summary
Hi,
An "Assertion failed" error occurs when I attempt to set CUDA Graphs feature and gmx mdrun "-bonded gpu" argument available at the same time in GROMCACS 2023.1.
Here is environment setting about GROMACS:
#Gromacs
export GMX_GPU_DD_COMMS=true
export GMX_CUDA_GRAPH=true
export GMX_GPU_PME_DECOMPOSITION=true
export GMX_GPU_PME_PP_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=true
And, Here are error messages:
Program: gmx mdrun, version 2023.1
Source file: src/gromacs/gpu_utils/device_stream.cu (line 100)
Function: DeviceStream::synchronize() const::<lambda()>
Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #400
(cudaErrorInvalidResourceHandle): invalid resource handle.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
GROMACS version
GROMACS version: 2023.1
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/gcc-11 GNU 11.3.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/g++-11 GNU 11.3.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
CUDA compiler: /opt/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Mon_Apr__3_17:16:06_PDT_2023;Cuda compilation tools, release 12.1, V12.1.105;Build cuda_12.1.r12.1/compiler.32688072_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.10
CUDA runtime: 12.10
Steps to reproduce
Firstly, set GMX_CUDA_GRAPH as available, using shell command:
export GMX_CUDA_GRAPH=true
And then, run my simulation via gmx mdrun. Here is my command:
gmx mdrun -v -deffnm Produtcion -s Production.tpr -ntomp 12 -pin on -ntmpi 1 -update gpu -bonded gpu
An "Assertion failed" error occurs.
:-) GROMACS - gmx mdrun, 2023.1 (-:
Executable: /home/yangzichen/Software/GMX-2023.1/bin/gmx
Data prefix: /home/yangzichen/Software/GMX-2023.1
Working dir: /home/yangzichen/Documents/ZZH/MD
Command line:
gmx mdrun -v -deffnm Produtcion -s Production.tpr -ntomp 12 -pin on -ntmpi 1 -update gpu -bonded gpu
Back Off! I just backed up Produtcion.log to ./#Produtcion.log.1#
Reading file Production.tpr, VERSION 2023.1 (single precision)
GMX_CUDA_GRAPH environment variable is detected. The experimental CUDA Graphs feature will be used if run conditions allow.
Update groups can not be used for this system because atoms that are (in)directly constrained together are interdispersed with other atoms
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
CUDA Graphs will be used, provided there are no CPU force computations.
Using 1 MPI thread
Using 12 OpenMP threads
Back Off! I just backed up Produtcion.xtc to ./#Produtcion.xtc.1#
Back Off! I just backed up Produtcion.edr to ./#Produtcion.edr.1#
starting mdrun '01-JUL-22 C3G and BSA in 0.154M NaCl running 1ns in water'
1000000 steps, 2000.0 ps.
step 400
-------------------------------------------------------
Program: gmx mdrun, version 2023.1
Source file: src/gromacs/gpu_utils/device_stream.cu (line 100)
Function: DeviceStream::synchronize() const::<lambda()>
Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #400
(cudaErrorInvalidResourceHandle): invalid resource handle.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
I try to run simulations using same environment setting and mdrun command on other computer via GROMACS 2023.1 compiled by CUDA 11.7 and GCC 11.2. Assertion failed still exist.
And, I annotated "GMX_CUDA_GRAPH=true" and run command:
gmx mdrun -v -deffnm Produtcion -s Production.tpr -ntomp 12 -pin on -ntmpi 1 -update gpu -bonded gpu
It worked, so confirmed the GMX_CUDA_GRAPH and bonded gpu in conflict.
Possible fixes
Do NOT CUDA Graphs feature and gmx mdrun "-bonded gpu" argument available at the same time.