I have some follow-up thoughts related to the code organization in the PP-PME communication. I wanted to discuss separately instead of adding more comments to !37 (closed).
The implementation made the choice to unify the direct communication codepaths, both for halo exchange and PP-PME. For the former, given that there are some auxiliary steps and GPU halo-exchange specific data structures involved, this does make sense. However, for the latter less so as both x/f send/recv, when done with lib-MPI, are essentially the same call just with GPU pointers passed to the MPI.
As !37 (closed) unifies the GPU comm tMPI/libMPI code-paths which largely differ in implementation we are giving up the opportunity of code reuse, and in some cases duplicating code which is not ideal code organization. Some examples of these below.
What did not help is that the methods on the PME side were named in a way that does not reflect what they actually do, e.g. launchReceiveCoordinatesFromPp does not reflect well the actual task it accomplished: it only receives events associated with the P2P copies, but not the coordiantes themselves.
My impression is that this specific code-path would be better off refactored to use a launchReceiveCoordinatesSynchronizers() on the P2P codepath merge the CUDA-aware and regular MPI code-paths as those two are nearly identical.
In pmeForceSenderGpu->sendFToPp() there is also code duplication, sendCount_counter essentially replicates the messages counter we already have in the sendFToPp().
On the PP side there seem to be less issues, although at places we do struggle with class interface consistency vs uniform MPI behavior (see !37 (comment 519673553)).
I would suggest we finish !37 (closed) and discuss how to organize the code better and plan on improvements as follow-up.
The main reason I started with this design was to provide single interface for GPU-direct communication. However, as we have seen keeping interface same for both implementations is not that easy.
IMO, we should consider making thread-MPI CUDA-aware (#3464 (closed)) and we can remove current cudamemcpy based implementation. This also makes it easier to implement new features such as PME decomposition on thread-MPI.
We should make sure we maintain the advantages of the fully async communication the tMPI+P2P setup allows. This is to an extent inherently incompatible with the lib-MPI implementation which can not be as GPU-aware as our own implementation (events and streams interop).
Hence, I think at least on the short- to mid-term the tMPI and MPI communication setup will have significantly different characteristics.
Admittedly, if we assume that we may will only use the CPU codepath for recording the schedule and will execute it with a graph/DAG API, having a communication scheme expressed with two-sided MPI message-passing style is less of a concern (e.g. posting a get of PME forces from PP late, just before the reduction is not ideal, but if this is only used to build the DAG the execution will not be affected). However, we need to be careful to not restrict the lib-MPI implementation by this.
I would also prefer that we do not bake in such assumptions as a first approach (as it only works if every task within a regular MD step is offloaded and in addition it is not portable).
Hence, I have the feeling that we should explore ways to treat the tMPI / P2P path separately and unify more of the CPU MPI path with the CUDA-aware MPI path which have very similar needs. Code-wise this means that when the implementations share little, we should separate them and use helper classes for the few things that both code-paths need like GpuEventSynchronizers.
If we wanted to converge the GPU-aware implementations instead, I think we should look into using persistent communications with non-blocking calls (or perhaps SHMEM); one-sided MPI is likely still not the ideal choice.
For future record, here's some additional code duplication we could avoid if we merged the lib-MPI codepaths: PmeCoordinateReceiverGpu uses its own MPI request array and counter incremented in launchReceiveCoordinatesFromPp() and relies on an MPI_Isend identical to the one on the CPU path.
I suggest we now close this issue since the functionality is now in place. It is blocked by #3950 (closed), but that also is already blocking #2915 so we don't lose anything by removing this one out of the chain.