GPU direct communicaton with CUDA-aware lib-MPI

added CUDA GPU acceleration labels

@gaurav.garg @alangray3 A bit late, but decided to open a new issue instead of further extending discussion on the umbrella issue #2915.

marked this issue as related to #2915

changed the description

added Design discussion label

I have some follow-up thoughts related to the code organization in the PP-PME communication. I wanted to discuss separately instead of adding more comments to !37 (closed).

The implementation made the choice to unify the direct communication codepaths, both for halo exchange and PP-PME. For the former, given that there are some auxiliary steps and GPU halo-exchange specific data structures involved, this does make sense. However, for the latter less so as both x/f send/recv, when done with lib-MPI, are essentially the same call just with GPU pointers passed to the MPI.

As !37 (closed) unifies the GPU comm tMPI/libMPI code-paths which largely differ in implementation we are giving up the opportunity of code reuse, and in some cases duplicating code which is not ideal code organization. Some examples of these below.

What did not help is that the methods on the PME side were named in a way that does not reflect what they actually do, e.g. launchReceiveCoordinatesFromPp does not reflect well the actual task it accomplished: it only receives events associated with the P2P copies, but not the coordiantes themselves. My impression is that this specific code-path would be better off refactored to use a launchReceiveCoordinatesSynchronizers() on the P2P codepath merge the CUDA-aware and regular MPI code-paths as those two are nearly identical.

In pmeForceSenderGpu->sendFToPp() there is also code duplication, sendCount_counter essentially replicates the messages counter we already have in the sendFToPp().

On the PP side there seem to be less issues, although at places we do struggle with class interface consistency vs uniform MPI behavior (see !37 (comment 519673553)).

I would suggest we finish !37 (closed) and discuss how to organize the code better and plan on improvements as follow-up.

@alangray3 @gaurav.garg @berkhess Let me know what do you think.

I agree @pszilard with your points.

The main reason I started with this design was to provide single interface for GPU-direct communication. However, as we have seen keeping interface same for both implementations is not that easy.

IMO, we should consider making thread-MPI CUDA-aware (#3464 (closed)) and we can remove current cudamemcpy based implementation. This also makes it easier to implement new features such as PME decomposition on thread-MPI.

We should make sure we maintain the advantages of the fully async communication the tMPI+P2P setup allows. This is to an extent inherently incompatible with the lib-MPI implementation which can not be as GPU-aware as our own implementation (events and streams interop). Hence, I think at least on the short- to mid-term the tMPI and MPI communication setup will have significantly different characteristics.

Admittedly, if we assume that we may will only use the CPU codepath for recording the schedule and will execute it with a graph/DAG API, having a communication scheme expressed with two-sided MPI message-passing style is less of a concern (e.g. posting a get of PME forces from PP late, just before the reduction is not ideal, but if this is only used to build the DAG the execution will not be affected). However, we need to be careful to not restrict the lib-MPI implementation by this. I would also prefer that we do not bake in such assumptions as a first approach (as it only works if every task within a regular MD step is offloaded and in addition it is not portable).

Hence, I have the feeling that we should explore ways to treat the tMPI / P2P path separately and unify more of the CPU MPI path with the CUDA-aware MPI path which have very similar needs. Code-wise this means that when the implementations share little, we should separate them and use helper classes for the few things that both code-paths need like GpuEventSynchronizers.

If we wanted to converge the GPU-aware implementations instead, I think we should look into using persistent communications with non-blocking calls (or perhaps SHMEM); one-sided MPI is likely still not the ideal choice.

mentioned in merge request !37 (closed)

For future record, here's some additional code duplication we could avoid if we merged the lib-MPI codepaths: PmeCoordinateReceiverGpu uses its own MPI request array and counter incremented in launchReceiveCoordinatesFromPp() and relies on an MPI_Isend identical to the one on the CPU path.

mentioned in merge request !1270 (merged)

mentioned in merge request !1258 (merged)

mentioned in merge request !1272 (merged)

mentioned in merge request !1277 (merged)

mentioned in merge request !1291 (merged)

marked this issue as related to #2891 (closed)

marked this issue as related to #2890 (closed)

mentioned in commit 4ffbb475

mentioned in commit 429ebc9e

mentioned in commit 5296dd87

mentioned in commit d0a78ba6

mentioned in commit 8f2a7175

mentioned in commit c465651d

mentioned in commit 13678441

mentioned in commit 591a27c1

mentioned in commit 1a970524

mentioned in commit 19c9b9f5

mentioned in commit a84b77cd

marked this issue as related to #4028

mentioned in commit 7f2aeec0

marked this issue as related to #3950 (closed)

I suggest we now close this issue since the functionality is now in place. It is blocked by #3950 (closed), but that also is already blocking #2915 so we don't lose anything by removing this one out of the chain.

I'll go ahead and close it - feel free to re-open if you disagree.

OK, besides the docs tracked in #3950 (closed), it was missing release notes, but I just added that.

#4028 is also implicitly blocking, should have been marked accordingly.

removed the relation with #3950 (closed)

closed

mentioned in merge request !2302 (merged)

mentioned in commit e00ca2a0

removed Design discussion label

GPU direct communicaton with CUDA-aware lib-MPI

Designs

Child items ...

Activity

GPU direct communicaton with CUDA-aware lib-MPI

Blocks

Relates to

Activity