MPI_COMM_WORLD remediation
Summary
Several MPI_Comm values are still hard-coded to MPI_COMM_WORLD, which may conflict with usage under the #3644 (closed) feature.
We need to evaluate the remaining occurrences of MPI_COMM_WORLD and perform appropriate remediation, where necessary.
For developers: Why is this important?
For several years, we have been trying to make libgromacs friendly to API use cases and external execution management. To this end, we need to allow client software to take responsibility for global resource management, including MPI initialization and finalization, and potentially to subdivide available resources (such as by providing an MPI subcommunicator instead of letting libgromacs assume ownership of MPI_COMM_WORLD).
2021 included the C++ interface for passing the MPI_Comm from the client, but remaining occurrences of MPI_COMM_WORLD cause confusion, brittle code, and in some cases prevent proper behavior under use cases that would seem to be supported by #3644 (closed).
This issue describes a set of preliminary work for #4422 (closed). There is some overlap with #3688 (closed), but #3688 (closed) was not explicitly targeted for 2022Q2, as #4422 (closed) was.
Details
-
gmx_fatal_collective
(gmxlib/network.h
) may determine thatMPI_Finalize
should be called if the communicator is MPI_COMM_WORLD, and the call togmx_exit_on_fatal_error
checks onlygmx_mpi_initialized()
(not the initialization depth fromgmx::init()
). We should be able to mitigate bydup
ing MPI_COMM_WORLD in API use cases, but we should check. -
gmx_collect_hardware_mpi()
(fromgmx_detect_hardware()
) is hard-coded to reduce hardware detection results on MPI_COMM_WORLD. This is a reminder that clients providing an MPI communicator should also specify other constraints on detectable resources that should be considered usable (ref #3650, #3952). Of more immediate concern, the communicator for the MPI_Allreduce needs to be provided, or simulation setup will hang at hardware detection if less than the full set of processes is allocated for simulation. -
gmx_print_detected_hardware()
(in its call todetected_hardware_string()
) gets its rank information from MPI_COMM_WORLD. This shouldn't hang, but could give potentially misleading output if the allocation is subdivided for multiple simulations. -
sparse_parallel_eigensolver
uses MPI_COMM_WORLD for Comm_size and Comm_rank, but doesn't seem to use the results for anything. This function is used bysparse_eigensolver()
if Comm_size on MPI_COMM_WORLD gives more than 1 and GMX_MPI_NOT is defined. However, I don't see where GMX_MPI_NOT would be defined. -
Mdrunner::cloneOnSpawnedThread
uses MPI_COMM_WORLD instead of the communicators from the prototype instance'slibraryWorldCommunicator
andsimulationCommunicator
values. This is actually partly in support of the logic for, e.g.findIsSimulationMasterRank
. This is confusing and should be documented. It may also improve readability to instead reference TMPI_COMM_WORLD here and infindIsSimulationMasterRank()
. -
gmx_multisim_t::~gmx_multisim_t()
relies on a comparison to MPI_COMM_WORLD to decide whether to callMPI_Comm_free
. This could result in duplicatedMPI_Comm_free
calls. This should be mitigated by an RAII Comm handle. See also #3688 (closed). -
gmx_check_thread_affinity_set()
->detectDefaultAffinityMask()
callsMPI_Allreduce
on MPI_COMM_WORLD, which would hang if the full allocation is not used for libgromacs tasks following roughly the same initialization path. -
gmx_node_num()
andgmx_node_rank()
are wrappers forMPI_Comm_size()
andMPI_Comm_rank()
on MPI_COMM_WORLD. They seem to be misleadingly named. Maybe they should be factored away with the introduction of an RAII communicator holder, but they should at least be clarified and updated to require an argument. -
gmx_broadcast_world()
does what it says, substituting a no-op when!GMX_MPI
. Presumably, it allows call sites to avoid knowledge of thecomm_rec
, so the use cases will need to be investigated. -
gmx_abort()
explicitly wrapsMPI_Abort(MPI_COMM_WORLD, errorno)
. This is fine for programs, but I think there are cases wheregmx_abort()
can be called in library functions. This could be tricky to resolve in the long run. In the short term, we may have to accept that an entire job may get taken down by certain types of errors until further notice. -
gmx_exit_on_fatal_error()
has a COMM_WORLD barrier, which I think is intended to block until the MPI_Abort occurs elsewhere. I think this is probably okay. -
gmx::finalize()
has aMPI_Barrier(MPI_COMM_WORLD)
, but checksg_initializationCounter
, so should be safe. -
Various uses of MPI_COMM_WORLD occur in blocks that are conditional on #if GMX_THREAD_MPI
. These are harmless, but it seems like it would be more readable and future-proof to update these to referenceTMPI_COMM_WORLD
instead.