complex/nst_mismatch fails with hipSYCL, GPU-aware MPI, 2 ranks, -nb gpu -pme cpu -update cpu
Summary
Seen in CI with weekly CI, can be reproduced quite reliably when running manually in the same docker image (registry.gitlab.com/gromacs/gromacs/ci-ubuntu-22.04-llvm-15-cuda-11.7.1-hipsycl-0.9.4-rocm-5.3.3:latest
) on our infrastructure with a single GPU.
$ /usr/local/cmake-3.18.4/bin/cmake ../ -DGMX_GPU=SYCL -DGMX_MPI=ON -DGMX_SIMD=AVX2_256 \
-DCMAKE_BUILD_TYPE=RelWithAssert -DGMX_SYCL=ACPP -DGMX_SYCL_HIPSYCL=ON \
-DHIPSYCL_TARGETS='cuda:sm_50,sm_52,sm_60,sm_61,sm_70,sm_75' -DGMX_GPU_FFT_LIBRARY=VkFFT \
-DCMAKE_C_COMPILER_LAUNCHER= -DCMAKE_CXX_COMPILER_LAUNCHER= \
-DCMAKE_C_COMPILER=clang-15 -DCMAKE_CXX_COMPILER=clang++-15 \
-DGMX_USE_NVTX=ON -DGMX_CYCLE_SUBCOUNTERS=ON \
-DNVTX_INCLUDE_DIR=/usr/local/cuda-11.7/targets/x86_64-linux/include/ \
-DNVTX_LIBRARY=/usr/local/cuda-11.7/targets/x86_64-linux/lib/libnvToolsExt.so
...
$ cd tests/complex/nst_mismatch
$ GMX_ENABLE_DIRECT_GPU_COMM=1 GMX_FORCE_GPU_AWARE_MPI=1 HIPSYCL_MAX_CACHED_NODES=0 \
mpirun -np 2 ../../../bin/gmx_mpi mdrun -notunepme -npme 0 -ntomp 2 -pme cpu -update cpu
The energies match well on steps 0 and 14, but on steps 28 and 30 "Coul. recip" energy (and, as a consequence total energies) fluctuate from run to run:
Observations:
- Setting
CUDA_LAUNCH_BLOCKING=1
fixes the problem- Up to (and including) Step 26, the input coordinates for PME calculation are identical (a few differences of 1e-4 absolute) and so are per-rank long-range energies (up to 1e-3 absolute); on Step 27, coordinate differences become much more numerous and larger (up to 4e-4) and the per-rank long-range energies start differing by ~1e0 (absolute).
- Disabling GPU-aware MPI / GPU-direct communications fixes the problem
- Running under
compute-sanitizer --tool=(sync|init|mem)check
orvalgrind
fixes the problem- racecheck hangs
- Issue reproduces fine when running under nvprof (no nsys in the image).
- Two ranks recorded with energy not matching the reference: fail_143399.nvprof fail_143401.nvprof
- Re-ran same build until energy matched the reference: good_143348.nvprof good_143350.nvprof
- Changing
HIPSYCL_MAX_CACHED_NODES
does not seem to affect the behavior. - Setting
-dlb no
does not seem to affect the behavior. - Does not reproduce with native CUDA in the same container.
- Reproduces with hipSYCL 0.9.4 and 23.10.0
- Setting
HIPSYCL_ALLOW_INSTANT_SUBMISSION
fixes the issue (supported in 23.10.0 only) - Reproduced in the same docker image on a different system (with sm_86 GPU).
- Changing
nstlist
from 13 to 5/8/10/12/14/15/17/26 fixes the problem. Setting it to 11 or 9 or 7 still exhibits the bug. Othernst*
variables do not seem to have huge effect. - Synchronizing with the Halo stream after (but not before) submitting indexMap H2D copy in
reinitHalo
seems to help. - Something wrong with F-Halo on step 26 (NS), pre-integration forces are way off.
- Host forces are lost, sometimes on one rank, sometimes on both.
- Disabling
StepWorkload.useGpuFBufferOps
on NS steps helps
Affected commits:
Edited by Andrey Alekseenko