Improve handling of GPU IDs when masking is used

Summary

Device masking by runtime is widely used:

LUMI uses ROCR_VISIBLE_DEVICES to set device per-rank in their scripts
- And with AdaptiveCpp 23.10, this has better performance than letting GROMACS choose the GPU due to runtime behavior, #4965
IntelMPI also restricts device visibility per-rank when enabling GPU-awareness

However, GROMACS is some places assume that each rank sees all GPUs. This has at least the following effects:

The hardware topology reported in the log is a lie, it says each node only has 1 GPU. That can confuse users.
The task assignment reported in the log is a lie, it says all tasks are assigned to the same GPU. That can confuse users.
The mpi_comm_gpu_shared communicators created in dd_setup_dlb_resource_sharing intends to create a separate MPI communicator for each GPU, but now it considers that all ranks on the node use the same GPU. Based on brief looking at, should not be critical, but not nice.

Possible solutions:

When the env. var is set, report in the hardware log that the GPU info is not reliable.
Get some kind of device UUID or PCIE address from the GPU runtime and count the number of different ones.
Querying hardware info from HWLOC when linked.

Observations by Mark:

The device ID aspects are not trivial. We use the set of device IDs for two things:

indexing into via mdrun -gpu_id and mdrun -gputasks

determining whether ranks are sharing devices

For the former, we need a field of DeviceInformation that corresponds to the order in which the runtime presents the devices (given the environment at the time). Currently that is id but perhaps should be more like deviceIndexFromRuntime.

For the latter, we need a unique identifier for the piece of hardware, because the environment could be presenting the same device to different ranks on the same node with the same or different deviceIndexFromRuntime. (One could presumably do this with CUDA_VISIBLE_DEVICES or the ROCm equivalent, one can definitely do this with ONEAPI_DEVICE_SELECTOR or IMPI variables.) If the runtime provides a way to get UUIDs, then we should do that, store it as uint128_t and use that for when we need to know about physical device sharing (e.g. DLB). Respectively that's struct cudaDeviceProp::uuid, cl_khr_device_uuid, hipDeviceGetUUID(), and sycl::device.get_info<sycl::ext::intel::info::device::uuid>() for the respective GPU runtimes.

Exact steps to reproduce

LUMI-G: Example log from a full-node run https://github.com/Lumi-supercomputer/gromacs-on-lumi-workshop/blob/main/Exercise-3.3/STMV/ex3.3_8x7_jID5774166.log:

Running on 1 node with total 56 cores, 112 processing units, 1 compatible GPU


On host nid007959 1 GPU selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PME:0

IntelMPI:

intel-gpu04:~$ module load intel-oneapi/2024.0.1 
Loading intel-oneapi/2024.0.1 with path /opt/tcbsys/intel-oneapi/2024.0.1

intel-gpu04:~$ mpirun -np 2 sycl-ls # No GPU-aware MPI, each rank sees both devices
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.26.26690.36]
[opencl:gpu:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.26.26690.36]
[opencl:gpu:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]

intel-gpu04:~$ I_MPI_OFFLOAD=2 mpirun -np 2 sycl-ls # With GPU aware MPI, each rank sees one device
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]
[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.26.26690.36]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26690]

If this is a bug, (1) what happens, and (2) what did you expect to happen?

The MD log reports that we have one GPU per node when we have more. It should correctly report the number of devices or at least disclose that the info is unreliable.
The DLB code thinks that ranks share a GPU when they do not. MPI communicator that intends to contain ranks sharing a GPU should only contain ranks actually sharing a GPU.

For %2025.devcycle3, we should at least add a disclaimer about unreliable info, perhaps backport it to 2024 too.

CC @mark.j.abraham @pszilard @acmnpv @al42and