Fix DLB with IMPI+PVC using device UUID
Currently GROMACS finds all available devices, and record in DeviceInformation::id
the index corresponding to the order in which the runtime provided it to GROMACS. gmx_hwinfo_t::deviceInfoList
is a vector of handles to DeviceInformation
objects, the order of which happens to correspond to the runtime order, so the index of an object within gmx_hwinfo_t::deviceInfoList
happens to be the same as the value of the id
field of that object.
In some places we use the integer index of such an object as a "device ID" where we should probably use a DeviceInformation*
directly, given that gmx:hwinfo_t::deviceInfoList
is effectively a constant. That is a bit harder to misuse and avoids having to pass around the whole gmx_hwinfo_t
object as well.
The gmx_hwinfo_t::id
field is used for two things:
- indexing into via
mdrun -gpu_id
andmdrun -gputasks
to see if the user's IDs correspond to compatible devices - determining whether ranks are sharing devices to help DLB do better
We tacitly assume that the user understands the impact of any per-rank environment variables (e.g. CUDA_VISIBLE_DEVICES
or the ROCm equivalent, or ONEAPI_DEVICE_SELECTOR
) that might affect which devices are visible with which indices. We could look for those variables while handling mdrun -gpu_id
and mdrun -gputasks
and warn of possible unexpected consequences.
If such variables are in use (which is routine when using IMPI on L0 and presumably later IMPI on CUDA), then typically each rank sees only a single device whose DeviceInformation::id
will be 0. This is used to split an MPI communicator for DLB, but that is ineffective because all ranks would end up in the same DLB communicator.
Formally, the behaviour of (at least) DLB+IMPI+PVC is buggy in release-2024
, but I've not yet made a fixed version that can see how big the consequences are. We'd need that before we decide whether to fix it in release-2024
.
These thoughts arose following discussion from !4223 (merged) and should be addressed
-
@al42and started a discussion: (+10 comments)
Proposal
Instead of splitting on DeviceInformation::id
we should detect and use a device UUID to split that communicator. The GPU runtimes provide respectively struct cudaDeviceProp::uuid
, cl_khr_device_uuid
, hipDeviceGetUUID()
, and sycl::device.get_info<sycl::ext::intel::info::device::uuid>()
to provide the 32 8-bit fields of a UUID, which we can hash to an integer to use with MPI_Comm_split
.
We still need a field of DeviceInformation
that corresponds to the order in which the runtime presents the devices. Currently that is id
but perhaps should be more like deviceIndexFromRuntime
.