Skip to content

SYCL: Reduce the number of atomic ops in NBNXM fShift calculation

Andrey Alekseenko requested to merge sycl-nbnxm-reduce-number-of-fp-atomics into master

On Intel GPUs, the floating-point atomics are implemented as a compare-and-swap loop. It has particularly poor performance when updating the same memory location from the same work-group.

This change reduces the average time of NBNXM F+V kernel (Xe iGPU, ar-157k system) from 9.1 ms to 5.2 ms.

The code is, for now, specific to c_clSize==4, since we don't seem to need such optimization for AMD and NVIDIA devices.

Refs #3847.

Merge request reports