complex/nst_mismatch fails with hipSYCL, GPU-aware MPI, 2 ranks, -nb gpu -pme cpu -update cpu

Summary

Seen in CI with weekly CI, can be reproduced quite reliably when running manually in the same docker image (registry.gitlab.com/gromacs/gromacs/ci-ubuntu-22.04-llvm-15-cuda-11.7.1-hipsycl-0.9.4-rocm-5.3.3:latest) on our infrastructure with a single GPU.

$ /usr/local/cmake-3.18.4/bin/cmake ../ -DGMX_GPU=SYCL -DGMX_MPI=ON -DGMX_SIMD=AVX2_256 \
  -DCMAKE_BUILD_TYPE=RelWithAssert -DGMX_SYCL=ACPP -DGMX_SYCL_HIPSYCL=ON \
  -DHIPSYCL_TARGETS='cuda:sm_50,sm_52,sm_60,sm_61,sm_70,sm_75' -DGMX_GPU_FFT_LIBRARY=VkFFT \
  -DCMAKE_C_COMPILER_LAUNCHER= -DCMAKE_CXX_COMPILER_LAUNCHER= \
  -DCMAKE_C_COMPILER=clang-15 -DCMAKE_CXX_COMPILER=clang++-15 \
  -DGMX_USE_NVTX=ON -DGMX_CYCLE_SUBCOUNTERS=ON \
  -DNVTX_INCLUDE_DIR=/usr/local/cuda-11.7/targets/x86_64-linux/include/ \
  -DNVTX_LIBRARY=/usr/local/cuda-11.7/targets/x86_64-linux/lib/libnvToolsExt.so
...
$ cd tests/complex/nst_mismatch
$ GMX_ENABLE_DIRECT_GPU_COMM=1 GMX_FORCE_GPU_AWARE_MPI=1 HIPSYCL_MAX_CACHED_NODES=0 \
  mpirun -np 2 ../../../bin/gmx_mpi  mdrun -notunepme -npme 0 -ntomp 2 -pme cpu -update cpu

The energies match well on steps 0 and 14, but on steps 28 and 30 "Coul. recip" energy (and, as a consequence total energies) fluctuate from run to run:

Observations:

Setting CUDA_LAUNCH_BLOCKING=1 fixes the problem
- Up to (and including) Step 26, the input coordinates for PME calculation are identical (a few differences of 1e-4 absolute) and so are per-rank long-range energies (up to 1e-3 absolute); on Step 27, coordinate differences become much more numerous and larger (up to 4e-4) and the per-rank long-range energies start differing by ~1e0 (absolute).
Disabling GPU-aware MPI / GPU-direct communications fixes the problem
Running under compute-sanitizer --tool=(sync|init|mem)check or valgrind fixes the problem
- racecheck hangs
Issue reproduces fine when running under nvprof (no nsys in the image).
- Two ranks recorded with energy not matching the reference: fail_143399.nvprof fail_143401.nvprof
- Re-ran same build until energy matched the reference: good_143348.nvprof good_143350.nvprof
Changing HIPSYCL_MAX_CACHED_NODES does not seem to affect the behavior.
Setting -dlb no does not seem to affect the behavior.
Does not reproduce with native CUDA in the same container.
Reproduces with hipSYCL 0.9.4 and 23.10.0
Setting HIPSYCL_ALLOW_INSTANT_SUBMISSION fixes the issue (supported in 23.10.0 only)
Reproduced in the same docker image on a different system (with sm_86 GPU).
Changing nstlist from 13 to 5/8/10/12/14/15/17/26 fixes the problem. Setting it to 11 or 9 or 7 still exhibits the bug. Other nst* variables do not seem to have huge effect.
Synchronizing with the Halo stream after (but not before) submitting indexMap H2D copy in reinitHalo seems to help.
Something wrong with F-Halo on step 26 (NS), pre-integration forces are way off.
- Host forces are lost, sometimes on one rank, sometimes on both.
- Disabling StepWorkload.useGpuFBufferOps on NS steps helps

Affected commits:

ce8024bb (recent release-2024)
94882102 (v2024-beta)
7494d1b0 (late September)
cd348c8a (late July)

Edited Jan 03, 2024 by Andrey Alekseenko