Building for AMD GPUs (ROCm)
Collecting some thoughts here about support for AMD GPUs
In terms of our current design proposal, we need to see what our assumptions are and if they are reasonable:
- What is the compatibility between devices? Do newer GPUs support the execution of applications generated for previous versions? How can we ensure a decent fallback rather than have to populate all possible options for builds?
-
https://salsa.debian.org/rocm-team/community/team-project/-/wikis/supported-gpu-list seems like the best information here
- Also explains how to set instruction override environment variables to ensure compatibility. v9 seems very messy, there doesn't seem to be a natural order to support there (90c seems unlikely to be compatible with 90a).
-
https://salsa.debian.org/rocm-team/community/team-project/-/wikis/supported-gpu-list seems like the best information here
- Where can we find the capabilities supported by each version of the ROCm stack?
- For official support, look at https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus
- For unofficially supported GPUs, you may need to look at https://salsa.debian.org/rocm-team/community/team-project/-/wikis/supported-gpu-list
- How can we know what combinations are most used "in the wild"?
- From the existing EasyBuild ROCm PR it seems that the device drivers are an integral part of a ROCm. Is that a good idea? What is required on the system side to make things work? I see https://github.com/easybuilders/easybuild-easyconfigs/pull/18277/files#r1334391648 but perhaps there is more
-
From https://github.com/ROCm/ROCm/issues/1714#issuecomment-1128920548
The ROCm stack is composed of software broadly split into categories consisting of kernel module (or driver), runtime, compiler, libraries and AI.
Now in the case of EasyBuild, we probably want everything up to compilers to create a compiler toolchain and then have separate modules for further libraries and AI (which we may or may not bundle?)
-
- How do I check what GPU is on the host?
- https://salsa.debian.org/rocm-team/community/team-project/-/wikis/supported-gpu-list#identifying-hardware seems like the best information here
- I also see https://github.com/ROCm/rocm_smi_lib is deprecated in favour of https://github.com/ROCm/amdsmi , are there more examples of deprecation? What is current best practice (assuming we will just start without any baggage)?