Skip to content
Snippets Groups Projects
Commit 56e7168c authored by Andrey Alekseenko's avatar Andrey Alekseenko Committed by Szilárd Páll
Browse files

SYCL: Add PackedFloat3 for AMD CDNA2 devices

AMD CDNA2 devices, such as MI250, achieve full advertised performance
only when operating on packed FP32 floats, in a 2-wide SIMD way.
The compiler can apply this optimization automatically in most cases,
but as of ROCm 5.7, it introduces extra inefficiencies along the way.
Therefore, we use packed float2 explicitly in a few critical places.

2-5% speed-up of NBNXM kernels on MI250X.

See also #4874

Fixes #4854

Closes #4854
parent 336eea4e
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment