freeenergy/coulandvdwsequential_coul failed with CUDA-aware MPI and too many ranks
To reproduce, in tests/freeenergy/coulandvdwsequential_coul
directory:
$ GMX_FORCE_CUDA_AWARE_MPI=1 GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1 mpirun -np 8 ../../../bin/gmx_mpi mdrun -notunepme -nb gpu
#....
$ ../../../bin/gmx_mpi check -e /tmp/gromacs/build/cuda-mpi/tests/gromacs-regressiontests-release-2022/freeenergy/coulandvdwsequential_coul/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential
#....
Reading energy frame 11 time 0.011 Ryckaert-Bell. step 11: 7.03328, step 11: 7.08752
Reading energy frame 12 time 0.012 Ryckaert-Bell. step 12: 6.71805, step 12: 6.78365
Reading energy frame 13 time 0.013 Ryckaert-Bell. step 13: 6.34876, step 13: 6.4258
Reading energy frame 14 time 0.014 Ryckaert-Bell. step 14: 5.92299, step 14: 6.01035
Reading energy frame 15 time 0.015 Ryckaert-Bell. step 15: 5.43801, step 15: 5.53343
Reading energy frame 16 time 0.016 Ryckaert-Bell. step 16: 4.89448, step 16: 4.99503
Reading energy frame 17 time 0.017 Ryckaert-Bell. step 17: 4.30077, step 17: 4.4022
Reading energy frame 18 time 0.018 Ryckaert-Bell. step 18: 3.6744, step 18: 3.7718
Reading energy frame 19 time 0.019 Angle step 19: 7.92999, step 19: 7.98061
Ryckaert-Bell. step 19: 3.04084, step 19: 3.12947
Reading energy frame 20 time 0.020 Angle step 20: 7.15287, step 20: 7.20557
Ryckaert-Bell. step 20: 2.43051, step 20: 2.50636
step 20: block[1][ 2] (-5.535483e+01 - -5.522291e+01)
step 20: block[2][ 2] (-3.986024e+00 - -3.822755e+00)
step 20: block[3][ 2] (2.214181e+01 - 2.197490e+01)
step 20: block[4][ 2] (0.000000e+00 - -1.563120e-01)
step 20: block[5][ 2] (-1.107116e+01 - -1.122191e+01)
Angle step 21: 6.56793, step 21: 6.62042
Ryckaert-Bell. step 21: 1.87384, step 21: 1.93416
Angle step 22: 6.2521, step 22: 6.30424
Angle step 23: 6.16664, step 23: 6.21986
Angle step 24: 6.18433, step 24: 6.23941
Angle step 25: 6.15306, step 25: 6.20878
Angle step 26: 5.96892, step 26: 6.02226
Reading energy frame 30 time 0.030 Angle step 33: 6.35321, step 33: 6.40435
Ryckaert-Bell. step 33: 2.58422, step 33: 2.64499
Angle step 34: 6.73476, step 34: 6.80456
Ryckaert-Bell. step 34: 3.06988, step 34: 3.15741
Angle step 35: 6.71072, step 35: 6.79293
Ryckaert-Bell. step 35: 3.55621, step 35: 3.67167
Angle step 36: 6.20674, step 36: 6.28998
Ryckaert-Bell. step 36: 4.0175, step 36: 4.15919
Angle step 37: 5.32477, step 37: 5.39605
Ryckaert-Bell. step 37: 4.42449, step 37: 4.58854
Ryckaert-Bell. step 38: 4.7474, step 38: 4.92804
Ryckaert-Bell. step 39: 4.96079, step 39: 5.15095
Output varies from run to run, but it's always Ryckaert-Bell and Angle.
The following does not change the behavior:
- Changing
-ntomp
between 1, 2, 3, 4. - Hiding one of two GPUs via
CUDA_VISIBLE_DEVICES
. -bonded cpu
Any one of the following makes the test pass:
-
mpirun -np 6
. -nb cpu
- No
GMX_ENABLE_DIRECT_GPU_COMM=1
- Running under compute-sanitizer (
compute-sanitizer --target-processes all --tool=synccheck mpirun ...
).
The latter makes me suspect it's a synchronization issue. Large number of ranks requires suggests empty domains.
Tested on dev-gpu04
(2xRTX2080Ti) with bd810fb4 (release-2022
), cuda/11.6.2
, gcc/11.2
, and /nethome/aland/modules/modulefiles/mpich/4.0.0-cuda11.5
or openmpi/1.8.8-cuda6.5
.
Edited by Andrey Alekseenko