SYCL: Add PackedFloat3 for AMD CDNA2 devices
AMD CDNA2 devices, such as MI250, achieve full advertised performance only when operating on packed FP32 floats, in a 2-wide SIMD way. The compiler can apply this optimization automatically in most cases, but as of ROCm 5.7, it introduces extra inefficiencies along the way. Therefore, we use packed float2 explicitly in a few critical places.
2-5% speed-up of NBNXM kernels on MI250X.
See also #4874
Fixes #4854 (closed)
Edited by Andrey Alekseenko
Merge request reports
Activity
added SYCL label
- Resolved by Szilárd Páll
- Resolved by Szilárd Páll
- Resolved by Szilárd Páll
- Resolved by Szilárd Páll
Tested this too with ROCm 5.3,5.4,5.5,5.6, and 5.7, nothing much has changed (wrt 84dc9ef9 which is the last version I tested). I suggest to remove the draft tag.
- Resolved by Szilárd Páll
A couple of sentences in the commit message for the git history would be useful.
- Resolved by Szilárd Páll
- Resolved by Szilárd Páll
added 13 commits
-
ddb78ba2...336eea4e - 5 commits from branch
main
- 5da5e884 - Add FastFloat3 class, don't use it yet
- 40ff2ccb - Use FastFloat3 for fCiBuf
- 26f2e241 - Fix build
- d0b8b629 - Rename the class, add docs
- 6f3f634c - Silence compiler warnings
- 7c397454 - Add asserts and inlining attributes
- 99f08040 - Add a macro to toggle use of packed float3
- 3914b5dc - Move the struct to sycl_kernel_utils
Toggle commit list-
ddb78ba2...336eea4e - 5 commits from branch
Please register or sign in to reply