GPU detection handling and related MPI should move to higher level
The following discussion from !447 (merged) should be addressed:
-
@erik.lindahl started a discussion: (+10 comments) I anyway think we should move all the MPI-dependent stuff here to higher-level modules, so the hardware module is only about low-level detection of individual nodes, while the decisions that might require knowledge of simulation setups, etc., are handled elsewhere.
-
Elsewhere in the thread I suggested moving detection handling (and thus related MPI) up to somewhere around SimulationContext, so that mdrunner starts with const resources ready to use. -
Decide which module should do the MPI communication associated with hardware detection
Follow-up tasks observed during the work:
-
Detection should determine a set of compatible GPUs and provide a set of indices to them, task assignment should assign their indices, and those indices used for building high-level managers like DeviceStreamManager (rather than DeviceInformation) -
DeviceInformation should not also carry the device handle for SYCL -
Hardware detection should be able to be done only on one rank per node (for CPU, and all GPU SDKs), and the information gathered transferred via PhysicalNodeCommunicator to the other ranks on the node. -
Hardware detection should stop using MPI_COMM_WORLD
and instead uselibraryCommWorld
(#4457 (closed)) -
Hardware printing should use simulationCommunicator
, notMPI_COMM_WORLD
(see also #4457 (closed)) -
DeviceInfoList should either grow into a class with copy semantics or become something copyable like vector<variant>
-
gmx_hw_info_t can become trivially copyable once DeviceInfoList can be copied, which smooths several aspects -
the lifetime of CUDA and MPI resources may be coupled (see #3952) and this aspect should be considered also
Edited by M. Eric Irrgang