Improve handling of GPU IDs when masking is used
Summary
Device masking by runtime is widely used:
- LUMI uses
ROCR_VISIBLE_DEVICES
to set device per-rank in their scripts- And with AdaptiveCpp 23.10, this has better performance than letting GROMACS choose the GPU due to runtime behavior, #4965
- IntelMPI also restricts device visibility per-rank when enabling GPU-awareness
However, GROMACS is some places assume that each rank sees all GPUs. This has at least the following effects:
- The hardware topology reported in the log is a lie, it says each node only has 1 GPU. That can confuse users.
- The task assignment reported in the log is a lie, it says all tasks are assigned to the same GPU. That can confuse users.
- The
mpi_comm_gpu_shared
communicators created indd_setup_dlb_resource_sharing
intends to create a separate MPI communicator for each GPU, but now it considers that all ranks on the node use the same GPU. Based on brief looking at, should not be critical, but not nice.
Possible solutions:
- When the env. var is set, report in the hardware log that the GPU info is not reliable.
- Get some kind of device UUID or PCIE address from the GPU runtime and count the number of different ones.
- Querying hardware info from HWLOC when linked.
Observations by Mark:
The device ID aspects are not trivial. We use the set of device IDs for two things:
- indexing into via
mdrun -gpu_id
andmdrun -gputasks
- determining whether ranks are sharing devices
For the former, we need a field of
DeviceInformation
that corresponds to the order in which the runtime presents the devices (given the environment at the time). Currently that isid
but perhaps should be more likedeviceIndexFromRuntime
.For the latter, we need a unique identifier for the piece of hardware, because the environment could be presenting the same device to different ranks on the same node with the same or different
deviceIndexFromRuntime
. (One could presumably do this withCUDA_VISIBLE_DEVICES
or the ROCm equivalent, one can definitely do this withONEAPI_DEVICE_SELECTOR
or IMPI variables.) If the runtime provides a way to get UUIDs, then we should do that, store it asuint128_t
and use that for when we need to know about physical device sharing (e.g. DLB). Respectively that'sstruct cudaDeviceProp::uuid
,cl_khr_device_uuid
,hipDeviceGetUUID()
, andsycl::device.get_info<sycl::ext::intel::info::device::uuid>()
for the respective GPU runtimes.
Exact steps to reproduce
LUMI-G: Example log from a full-node run https://github.com/Lumi-supercomputer/gromacs-on-lumi-workshop/blob/main/Exercise-3.3/STMV/ex3.3_8x7_jID5774166.log:
Running on 1 node with total 56 cores, 112 processing units, 1 compatible GPU
On host nid007959 1 GPU selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PME:0
IntelMPI:
intel-gpu04:~$ module load intel-oneapi/2024.0.1
Loading intel-oneapi/2024.0.1 with path /opt/tcbsys/intel-oneapi/2024.0.1
intel-gpu04:~$ mpirun -np 2 sycl-ls # No GPU-aware MPI, each rank sees both devices
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.26.26690.36]
[opencl:gpu:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.26.26690.36]
[opencl:gpu:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
intel-gpu04:~$ I_MPI_OFFLOAD=2 mpirun -np 2 sycl-ls # With GPU aware MPI, each rank sees one device
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
If this is a bug, (1) what happens, and (2) what did you expect to happen?
- The MD log reports that we have one GPU per node when we have more. It should correctly report the number of devices or at least disclose that the info is unreliable.
- The DLB code thinks that ranks share a GPU when they do not. MPI communicator that intends to contain ranks sharing a GPU should only contain ranks actually sharing a GPU.
For %2025.devcycle3, we should at least add a disclaimer about unreliable info, perhaps backport it to 2024 too.