freeenergy/coulandvdwsequential_coul failed with CUDA-aware MPI and too many ranks

To reproduce, in tests/freeenergy/coulandvdwsequential_coul directory:

$ GMX_FORCE_CUDA_AWARE_MPI=1 GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1 mpirun -np 8 ../../../bin/gmx_mpi mdrun -notunepme -nb gpu
#....
$ ../../../bin/gmx_mpi check -e /tmp/gromacs/build/cuda-mpi/tests/gromacs-regressiontests-release-2022/freeenergy/coulandvdwsequential_coul/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential
#....
Reading energy frame     11 time    0.011         Ryckaert-Bell.   step  11:       7.03328,  step  11:      7.08752
Reading energy frame     12 time    0.012         Ryckaert-Bell.   step  12:       6.71805,  step  12:      6.78365
Reading energy frame     13 time    0.013         Ryckaert-Bell.   step  13:       6.34876,  step  13:       6.4258
Reading energy frame     14 time    0.014         Ryckaert-Bell.   step  14:       5.92299,  step  14:      6.01035
Reading energy frame     15 time    0.015         Ryckaert-Bell.   step  15:       5.43801,  step  15:      5.53343
Reading energy frame     16 time    0.016         Ryckaert-Bell.   step  16:       4.89448,  step  16:      4.99503
Reading energy frame     17 time    0.017         Ryckaert-Bell.   step  17:       4.30077,  step  17:       4.4022
Reading energy frame     18 time    0.018         Ryckaert-Bell.   step  18:        3.6744,  step  18:       3.7718
Reading energy frame     19 time    0.019         Angle            step  19:       7.92999,  step  19:      7.98061
Ryckaert-Bell.   step  19:       3.04084,  step  19:      3.12947
Reading energy frame     20 time    0.020         Angle            step  20:       7.15287,  step  20:      7.20557
Ryckaert-Bell.   step  20:       2.43051,  step  20:      2.50636
step 20: block[1][ 2] (-5.535483e+01 - -5.522291e+01)
step 20: block[2][ 2] (-3.986024e+00 - -3.822755e+00)
step 20: block[3][ 2] (2.214181e+01 - 2.197490e+01)
step 20: block[4][ 2] (0.000000e+00 - -1.563120e-01)
step 20: block[5][ 2] (-1.107116e+01 - -1.122191e+01)
Angle            step  21:       6.56793,  step  21:      6.62042
Ryckaert-Bell.   step  21:       1.87384,  step  21:      1.93416
Angle            step  22:        6.2521,  step  22:      6.30424
Angle            step  23:       6.16664,  step  23:      6.21986
Angle            step  24:       6.18433,  step  24:      6.23941
Angle            step  25:       6.15306,  step  25:      6.20878
Angle            step  26:       5.96892,  step  26:      6.02226
Reading energy frame     30 time    0.030         Angle            step  33:       6.35321,  step  33:      6.40435
Ryckaert-Bell.   step  33:       2.58422,  step  33:      2.64499
Angle            step  34:       6.73476,  step  34:      6.80456
Ryckaert-Bell.   step  34:       3.06988,  step  34:      3.15741
Angle            step  35:       6.71072,  step  35:      6.79293
Ryckaert-Bell.   step  35:       3.55621,  step  35:      3.67167
Angle            step  36:       6.20674,  step  36:      6.28998
Ryckaert-Bell.   step  36:        4.0175,  step  36:      4.15919
Angle            step  37:       5.32477,  step  37:      5.39605
Ryckaert-Bell.   step  37:       4.42449,  step  37:      4.58854
Ryckaert-Bell.   step  38:        4.7474,  step  38:      4.92804
Ryckaert-Bell.   step  39:       4.96079,  step  39:      5.15095

Output varies from run to run, but it's always Ryckaert-Bell and Angle.

The following does not change the behavior:

Changing -ntomp between 1, 2, 3, 4.
Hiding one of two GPUs via CUDA_VISIBLE_DEVICES.
-bonded cpu

Any one of the following makes the test pass:

mpirun -np 6.
-nb cpu
No GMX_ENABLE_DIRECT_GPU_COMM=1
Running under compute-sanitizer (compute-sanitizer --target-processes all --tool=synccheck mpirun ...).

The latter makes me suspect it's a synchronization issue. Large number of ranks requires suggests empty domains.

Tested on dev-gpu04 (2xRTX2080Ti) with bd810fb4 (release-2022), cuda/11.6.2, gcc/11.2, and /nethome/aland/modules/modulefiles/mpich/4.0.0-cuda11.5 or openmpi/1.8.8-cuda6.5.

Edited Apr 21, 2022 by Andrey Alekseenko