-
- Downloads
SYCL: Add PackedFloat3 for AMD CDNA2 devices
AMD CDNA2 devices, such as MI250, achieve full advertised performance only when operating on packed FP32 floats, in a 2-wide SIMD way. The compiler can apply this optimization automatically in most cases, but as of ROCm 5.7, it introduces extra inefficiencies along the way. Therefore, we use packed float2 explicitly in a few critical places. 2-5% speed-up of NBNXM kernels on MI250X. See also #4874 Fixes #4854 Closes #4854
Loading
-
mentioned in issue #4888 (closed)
-
mentioned in merge request !3865 (merged)
-
mentioned in commit ad4f2655
-
mentioned in commit 40961962
-
mentioned in commit cb23de8f
Please register or sign in to comment