SYCL: Reduce the number of atomic ops in NBNXM fShift calculation (!1741) · Merge requests · GROMACS / GROMACS

Andrey Alekseenko requested to merge sycl-nbnxm-reduce-number-of-fp-atomics into master Jun 14, 2021

On Intel GPUs, the floating-point atomics are implemented as a compare-and-swap loop. It has particularly poor performance when updating the same memory location from the same work-group.

This change reduces the average time of NBNXM F+V kernel (Xe iGPU, ar-157k system) from 9.1 ms to 5.2 ms.

The code is, for now, specific to c_clSize==4, since we don't seem to need such optimization for AMD and NVIDIA devices.

Refs #3847.

SYCL: Reduce the number of atomic ops in NBNXM fShift calculation

Merge request reports