Race condition on GPU halo exchange scheduling events in extreme corner case runs

Summary

In extreme run conditions (e.g. high numbers of thread-MPI tasks running on each GPU), the tasks can become so out of sync that there is a race condition on the event used to schedule synchronization of X and F GPU halo exchanges, causing a seg fault in cudaStreamWaitEvent.

More explicitly,

a given rank can mark it's halo event and send it to it's lower neighbor
the rank will receive the corresponding event from its upper neighbor and proceed to scheduling it's F halo exchange
it will send the same event (reused for X and F halo exchanges) to its upper neighbor
the event has not yet been processed by the lower neighbor, which is way behind.

Exact steps to reproduce

So far I've only reproduced it with 8 thread-MPI tasks running on 1 or 2 GPUs, with CUDA graphs enabled (for both STMV and ADH Dodec). Note the issue is not directly related to CUDA Graphs (it happens on non-graph steps), but use of graphs seems to help the tasks get further out of sync.

For developers: Why is this important?

GROMACS shouldn't crash.

Possible fixes

Use separate events for X and F GPU halo scheduling. Fix incoming.