ROCm runtime overhead from unused queues / queue multiplexing performance weirdness

Summary

When running with AdaptiveCpp on multi-GPU AMD node with hipSYCL/ACpp HIP Streams are created by each GROMACS process on all visible devices. Unless ROCR_VISIBLE_DEVICES is set, this leads to each process allocating some resources on each GPU. While this should not cause any problems in theory, in practice, with ROCm 5.3.3, there are significant performance overheads.

Data from @sebastian2802 on LUMI, running 8 single-GPU simulations:

Using -gpu_id (not hiding any devices):

Performance:      180.099        0.133
Performance:      179.146        0.134
Performance:      177.641        0.135
Performance:      176.538        0.136
Performance:      176.467        0.136
Performance:      144.293        0.166
Performance:      137.392        0.175
Performance:      136.465        0.176

Using ROCR_VISIBLE_DEVICES:

Performance:      214.939        0.112
Performance:      214.315        0.112
Performance:      214.179        0.112
Performance:      213.678        0.112
Performance:      213.608        0.112
Performance:      213.635        0.112
Performance:      213.022        0.113
Performance:      212.860        0.113

The above applied to LUMI/Dardel, ROCm 5.3.3. Newer ROCm versions might behave differently. Single GPU system are, by definition, not affected, and on the systems with fewer GPUs the effect should be less pronounced.

While the problem was exacerbated by the behavior of AdaptiveCpp 23.10 (https://github.com/AdaptiveCpp/AdaptiveCpp/issues/1246), this is likely to affect native HIP too.

Exact steps to reproduce

Run the code below on LUMI/Dardel node with ROCm 5.3.3 AdaptiveCpp 23.10 or hipSYCL 0.9.4. Then, replace ROCR_VISIBLE_DEVICES=${GPUID[${i}]} with -gpu_id ${GPUID[${i}]}. Observe that the performance becomes lower and fluctuates more.

GPUID=(4 5 2 3 6 7 0 1)

for i in $(seq -w 1 8); do
    ROCR_VISIBLE_DEVICES=${GPUID[${i}]} HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 \
        hwloc-bind --cpubind "core:$(((${i}-1)*7))-$(((${i}-1)*7 + 6))" \
        gmx_mpi mdrun -pin on -pinoffset ${PINOFFSET[${i}]} -pinstride 2 \
            -nb gpu -bonded gpu -pme gpu -update gpu -ntomp 4 \
            -g $i.log -nobackup -noconfout -nsteps 10000 -notunepme -resethway &> $i.out &
done
wait

Possible fixes

There are several places where the extra resources are allocated:

AdaptiveCpp 23.10 and earlier allocates, in background, 4 streams on each visible device for out-of-order executor, which GROMACS does not use. This is fixed in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1318 by not initializing multi_queue_executor unless it is requested (i.e., and out-of-order sycl::queue is created). So, AdaptiveCpp 2024.x would not have these problem.
GROMACS creates out-of-order queue during the device initialization, prompting the initialization of ACpp's multi_queue_executor (although, it is later destroyed, since SYCL runtime is de-initialized after the device detection stage). Fixed by !4032 (merged).
With both points above fixed, we do not have any lingering queues on unused devices. Still, turns out that even if the queues are destroyed, the device is "tainted" and performance suffers. So, our device detection loop iterating over all devices still affects performance, even if we avoid all the SYCL's multi-queue stuff. This appears to be an issue with ROCm runtime, and it could have been fixed in newer versions. Meanwhile, we can try to avoid this loop and do "lightweight" device detection, without running the kernel, when multiple AMD devices are detected.

There is also always the workaround of using ROCR_VISIBLE_DEVICES. The main downside is that it affects hardware reporting and also dissuades users from GROMACS's own -gpu_id mechanism. But that's how, e.g., LUMI recommends people to run GPU code.

Edited Jan 29, 2024 by Andrey Alekseenko