Skip to content

SYCL: Use shuffle-based i-force reduction

Andrey Alekseenko requested to merge aa-sycl-nbnxm-i-shuffles into hwe-release-2022

Following CUDA implementation (for NVIDIA) and OpenCL implementation (for Intel).

Speed-up of NBNXM kernels on RNAse test system (RF electrostatics and PME electrostatics). Relative to 26005307. New average time / old average time, lower is better:

# reduction steps RF F RF F+V PME F PME F+V
Intel Xe Max 1 1.07 7.49 1.05 4.28
NVIDIA RTX2080, hipSYCL 2 0.36 0.44 0.41 0.72
NVIDIA RTX2080, oneAPI 2 0.90 0.94 0.94 0.95
AMD Instinct MI50 3 0.98 1.01 0.99 0.99

Intel Xe Max: OneAPI 2022.1.0, OpenCL backend, compute-runtime 21.47.21710, intel-gpu02. Slow-down is likely related to multiple global atomics to the same location from the same work-group (!1741 (merged)). Thus, 1-step reduction is not used with cl_Size == 4.

NVIDIA RTX2080: hipSYCL 7102e5e + Clang 14.0, IntelLLVM 4a794df, CUDA 11.4, dev-gpu04. oneAPI kernels are ~10x slower than hipSYCL ones, so not the most relevant metric.

AMD Instinct MI50: hipSYCL 7102e5e, ROCm 4.5.2, dev-gpu06.

Refs #3847

Edited by Andrey Alekseenko

Merge request reports