SYCL: Avoid performance regression with ROCm 5.5 on MI250X (part 2)
AMD CDNA2 devices, such as MI250, achieve full advertised performance only when operating on packed FP32 floats, in a 2-wide SIMD way.
As noted in #4874, this explicit use of packing is necessary with ROCm 5.5+ to mitigate performance regression of NBNXM LJ Force Switch kernels, compared to ROCm 5.3 and earlier.
This is a commit 56e7168c (!3838 (merged)) cherry-picked from main to 2023, with release notes added.
Refs #4854 (closed)
Fixes #4874
Edited by Mark Abraham