freeenergy/coulandvdwsequential_coul fails with separate PME rank and CUDA

Summary

To reproduce, in tests/freeenergy/coulandvdwsequential_coul directory:

$ mpirun -np 2 ../../../bin/gmx_mpi mdrun -notunepme -nb gpu -pme gpu -update cpu -ntomp 1 -npme 1
$ ../../../bin/gmx_mpi check -e ../../gromacs-regressiontests-release-2022/freeenergy/coulandvdwsequential_coul/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential
#....
Reading energy frame      0 time    0.000         
step 0: block[1][ 2] (-3.850891e+01 - -3.934799e+01)
step 0: block[3][ 2] (1.540360e+01 - 1.573909e+01)
step 0: block[5][ 2] (-7.701825e+00 - -7.869583e+00)
Reading energy frame     20 time    0.020         step 20: block[1][ 2] (-5.535483e+01 - -4.876364e+01)
step 20: block[3][ 2] (2.214181e+01 - 1.950552e+01)
step 20: block[5][ 2] (-1.107116e+01 - -9.752747e+00)
Reading energy frame     40 time    0.040         step 40: block[1][ 2] (-5.145402e+01 - -4.861850e+01)
step 40: block[3][ 2] (2.058149e+01 - 1.944754e+01)
step 40: block[5][ 2] (-1.029101e+01 - -9.723885e+00)
Last energy frame read 40 time    0.040         
#....

There is a discrepancy already at step 0.

Looking at the log, there is a big discrepancy in "dVcoul/dl", while other energy terms are pretty much the same.

Initially reported in #4471 (closed) by @gaurav.garg. Reproduced on dev-gpu04 (2xRTX2080Ti) with 9ff5c75c (2022.1), cuda/11.5, gcc/11.2, and /nethome/aland/modules/modulefiles/mpich/4.0.0-cuda11.5.

Fails with or without setting GMX_ENABLE_DIRECT_GPU_COMM.
Fails even if we use single GPU.
Fails with 2, 3, or 4 ranks.
Keeping PME on CPU or using single MPI rank fixes this issue.
Unlike #4471 (closed), is not affected by compute-sanitizer.

Edited Apr 25, 2022 by Andrey Alekseenko