freeenergy/coulandvdwsequential_coul fails with separate PME rank and CUDA
Summary
To reproduce, in tests/freeenergy/coulandvdwsequential_coul
directory:
$ mpirun -np 2 ../../../bin/gmx_mpi mdrun -notunepme -nb gpu -pme gpu -update cpu -ntomp 1 -npme 1
$ ../../../bin/gmx_mpi check -e ../../gromacs-regressiontests-release-2022/freeenergy/coulandvdwsequential_coul/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential
#....
Reading energy frame 0 time 0.000
step 0: block[1][ 2] (-3.850891e+01 - -3.934799e+01)
step 0: block[3][ 2] (1.540360e+01 - 1.573909e+01)
step 0: block[5][ 2] (-7.701825e+00 - -7.869583e+00)
Reading energy frame 20 time 0.020 step 20: block[1][ 2] (-5.535483e+01 - -4.876364e+01)
step 20: block[3][ 2] (2.214181e+01 - 1.950552e+01)
step 20: block[5][ 2] (-1.107116e+01 - -9.752747e+00)
Reading energy frame 40 time 0.040 step 40: block[1][ 2] (-5.145402e+01 - -4.861850e+01)
step 40: block[3][ 2] (2.058149e+01 - 1.944754e+01)
step 40: block[5][ 2] (-1.029101e+01 - -9.723885e+00)
Last energy frame read 40 time 0.040
#....
There is a discrepancy already at step 0.
Looking at the log, there is a big discrepancy in "dVcoul/dl", while other energy terms are pretty much the same.
Initially reported in #4471 (closed) by @gaurav.garg. Reproduced on dev-gpu04
(2xRTX2080Ti) with 9ff5c75c (2022.1), cuda/11.5
, gcc/11.2
, and /nethome/aland/modules/modulefiles/mpich/4.0.0-cuda11.5
.
- Fails with or without setting
GMX_ENABLE_DIRECT_GPU_COMM
. - Fails even if we use single GPU.
- Fails with 2, 3, or 4 ranks.
- Keeping PME on CPU or using single MPI rank fixes this issue.
- Unlike #4471 (closed), is not affected by
compute-sanitizer
.
Edited by Andrey Alekseenko