Race condition on GPU halo exchange scheduling events in extreme corner case runs
Summary
In extreme run conditions (e.g. high numbers of thread-MPI tasks running on each GPU), the tasks can become so out of sync that there is a race condition on the event used to schedule synchronization of X and F GPU halo exchanges, causing a seg fault in cudaStreamWaitEvent.
More explicitly,
- a given rank can mark it's halo event and send it to it's lower neighbor
- the rank will receive the corresponding event from its upper neighbor and proceed to scheduling it's F halo exchange
- it will send the same event (reused for X and F halo exchanges) to its upper neighbor
- the event has not yet been processed by the lower neighbor, which is way behind.
Exact steps to reproduce
So far I've only reproduced it with 8 thread-MPI tasks running on 1 or 2 GPUs, with CUDA graphs enabled (for both STMV and ADH Dodec). Note the issue is not directly related to CUDA Graphs (it happens on non-graph steps), but use of graphs seems to help the tasks get further out of sync.
For developers: Why is this important?
GROMACS shouldn't crash.
Possible fixes
Use separate events for X and F GPU halo scheduling. Fix incoming.