freeenergy/coulandvdwsequential_coul failed with CUDA-aware MPI and too many ranks
To reproduce, in `tests/freeenergy/coulandvdwsequential_coul` directory: ``` $ GMX_FORCE_CUDA_AWARE_MPI=1 GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1 mpirun -np 8 ../../../bin/gmx_mpi mdrun -notunepme -nb gpu #.... $ ../../../bin/gmx_mpi check -e /tmp/gromacs/build/cuda-mpi/tests/gromacs-regressiontests-release-2022/freeenergy/coulandvdwsequential_coul/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential #.... Reading energy frame 11 time 0.011 Ryckaert-Bell. step 11: 7.03328, step 11: 7.08752 Reading energy frame 12 time 0.012 Ryckaert-Bell. step 12: 6.71805, step 12: 6.78365 Reading energy frame 13 time 0.013 Ryckaert-Bell. step 13: 6.34876, step 13: 6.4258 Reading energy frame 14 time 0.014 Ryckaert-Bell. step 14: 5.92299, step 14: 6.01035 Reading energy frame 15 time 0.015 Ryckaert-Bell. step 15: 5.43801, step 15: 5.53343 Reading energy frame 16 time 0.016 Ryckaert-Bell. step 16: 4.89448, step 16: 4.99503 Reading energy frame 17 time 0.017 Ryckaert-Bell. step 17: 4.30077, step 17: 4.4022 Reading energy frame 18 time 0.018 Ryckaert-Bell. step 18: 3.6744, step 18: 3.7718 Reading energy frame 19 time 0.019 Angle step 19: 7.92999, step 19: 7.98061 Ryckaert-Bell. step 19: 3.04084, step 19: 3.12947 Reading energy frame 20 time 0.020 Angle step 20: 7.15287, step 20: 7.20557 Ryckaert-Bell. step 20: 2.43051, step 20: 2.50636 step 20: block[1][ 2] (-5.535483e+01 - -5.522291e+01) step 20: block[2][ 2] (-3.986024e+00 - -3.822755e+00) step 20: block[3][ 2] (2.214181e+01 - 2.197490e+01) step 20: block[4][ 2] (0.000000e+00 - -1.563120e-01) step 20: block[5][ 2] (-1.107116e+01 - -1.122191e+01) Angle step 21: 6.56793, step 21: 6.62042 Ryckaert-Bell. step 21: 1.87384, step 21: 1.93416 Angle step 22: 6.2521, step 22: 6.30424 Angle step 23: 6.16664, step 23: 6.21986 Angle step 24: 6.18433, step 24: 6.23941 Angle step 25: 6.15306, step 25: 6.20878 Angle step 26: 5.96892, step 26: 6.02226 Reading energy frame 30 time 0.030 Angle step 33: 6.35321, step 33: 6.40435 Ryckaert-Bell. step 33: 2.58422, step 33: 2.64499 Angle step 34: 6.73476, step 34: 6.80456 Ryckaert-Bell. step 34: 3.06988, step 34: 3.15741 Angle step 35: 6.71072, step 35: 6.79293 Ryckaert-Bell. step 35: 3.55621, step 35: 3.67167 Angle step 36: 6.20674, step 36: 6.28998 Ryckaert-Bell. step 36: 4.0175, step 36: 4.15919 Angle step 37: 5.32477, step 37: 5.39605 Ryckaert-Bell. step 37: 4.42449, step 37: 4.58854 Ryckaert-Bell. step 38: 4.7474, step 38: 4.92804 Ryckaert-Bell. step 39: 4.96079, step 39: 5.15095 ``` Output varies from run to run, but it's always Ryckaert-Bell and Angle. The following does **not** change the behavior: - Changing `-ntomp` between 1, 2, 3, 4. - Hiding one of two GPUs via `CUDA_VISIBLE_DEVICES`. - `-bonded cpu` Any one of the following makes the test pass: - `mpirun -np 6`. - `-nb cpu` - No `GMX_ENABLE_DIRECT_GPU_COMM=1` - Running under compute-sanitizer (`compute-sanitizer --target-processes all --tool=synccheck mpirun ...`). The latter makes me suspect it's a synchronization issue. Large number of ranks requires suggests empty domains. Tested on `dev-gpu04` (2xRTX2080Ti) with bd810fb449009c6b32ce7afc9d4b7416a5a30cfa (`release-2022`), `cuda/11.6.2`, `gcc/11.2`, and `/nethome/aland/modules/modulefiles/mpich/4.0.0-cuda11.5` or `openmpi/1.8.8-cuda6.5`.
issue