GPU direct communications - Redmine #2915

This family of developments involves enablement of direct communications between GPU memory spaces. This issue is to track and discuss common sub-issues related to the development of these new features.

TODO

~~Investigate if GPU halo exchange inter-GPU sync in thread MPI case can be made more lightweight with cuStreamWaitValue32 (or similar).~~
- done with MPI exchange of pointers to events, with remote event enqueued to stream
- a CUDA memops-based solution is still desirable, but currently this requires a kernel module parameter and therefore it is not available by default (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html)

(from redmine: issue id 2915, created on 2019-04-02 by alangray3)

Relations:
- relates #2891 (closed)
- child #3082 (closed)
- child #3087 (closed)
- parent #3370 (closed)
Changesets:
- Revision 8f42be1e by Alan Gray on 2019-08-16T09:29:54Z:

GPU halo exchange

Activate with GMX_GPU_DD_COMMS environment variable.

Class to initialize and apply halo exchange functionality directly on
GPU memory space.

Fully operational for position buffer. Functionality also present for
force buffer, but not yet called (awaiting acceptance of force buffer
ops patches).

Data transfer for halo exchange is wrapped and has 2 implementations:
cuda-aware MPI (default with "real" MPI) and direct cuda memcpy
(default with thread MPI). With the latter, the P2P path will be taken
if the hardware support it, otherwise D2H,H2D.

Limitation: still only supports 1D data decomposition

TODO: implement support for higher numbers of dimensions.
TODO: integrate call to force buffer halo exchange, when force buffer ops
patches accepted.

Implements part of #2890
Associated with #2915

Change-Id: I8e6473481ad4d943df78d7019681bfa821bd5798

Revision 44f607d7 by Alan Gray on 2019-09-16T13:12:49Z:

GPU halo exchange

Activate with GMX_GPU_DD_COMMS and GMX_USE_GPU_BUFFER_OPS environment
variable.

Class to initialize and apply coordinate buffer halo exchange
functionality directly on GPU memory space.

Currently only supports direct cuda memcpy, and relies on thread MPI
being in use.

Updated gpucomm testing matrices to cover non-GPU case.

Limitation: still only supports thread MPI, 1D data decomposition and
only coordinate halo exchange

Implements part of #2890
Associated with #2915

Change-Id: I8e6473481ad4d943df78d7019681bfa821bd5798

Edited Mar 02, 2021 by Szilárd Páll