Atomics with hipSYCL and AMD
Current generation of AMD hardware does not have full support for floating point atomics in global memory.
Specifically:
- MI50 (gfx906): no support at all.
- MI100 (gfx908): hardware "no return" op, which does addition, but is not guaranteed to return the old value.
- MI200 (gfx90a): hardware "unsafe" op, which might not be fully IEEE-compilant.
By default, the atomicAdd(float*, float)
on global memory call gets compiled into a CAS-loop, which is bad, especially for NBNXM FV and PME Spread/Gather kernels.
Workarounds:
-
-munsafe-fp-atomics
compiler switch should force the compiler to use fast versions whenever possible.- In our experiments, ROCm 4.5.2 fails to do so in GROMACS 2022-hwe / hipSYCL 0.9.2. For simple code snippets, the flag works.
- Might have other performance benefits.
- Manually call the "fast" function.
- Works with ROCm 4.5.2, hipSYCL 0.9.2, MI100.
- No tested with MI200.
Relevant links:
-
https://github.com/ROCm-Developer-Tools/hipamd/issues/19: discussion of
atomicAddNoRet
. - https://github.com/illuhad/hipSYCL/issues/729: slightly related issue about automatic optimization in hipSYCL.
Tasks left to do:
-
!2643 (closed) (tested on MI100). -
Test on MI200.
Edited by Andrey Alekseenko