Crash with CUDA Graphs on multi-GPU
Summary
GROMACS crashes (with cudaErrorStreamCaptureInvalidated) when CUDA Graphs are enabled with domain decomposition, since !3538 (merged), due to extra sync introduced in !3516 (merged) which breaks graph capture. The issue is present in v2023.1. The sync isn't required when graphs are enabled, since in that case there exists logic to explicitly include the zeroing within the graph.
Exact steps to reproduce
Run a GPU-resident case with GMX_CUDA_GRAPH=1 and GMX_ENABLE_DIRECT_GPU_COMM=1 on multi-GPU with separate PME rank.
For developers: Why is this important?
GROMACS shouldn't crash.
Possible fixes
Avoid the extra sync when graphs are in use. Fix incoming.