SYCL: Use shuffle-based i-force reduction
Following CUDA implementation (for NVIDIA) and OpenCL implementation (for Intel).
Speed-up of NBNXM kernels on RNAse test system (RF electrostatics and PME electrostatics). Relative to 26005307. New average time / old average time, lower is better:
# reduction steps | RF F | RF F+V | PME F | PME F+V | |
---|---|---|---|---|---|
Intel Xe Max | 1 | 1.07 | 7.49 | 1.05 | 4.28 |
NVIDIA RTX2080, hipSYCL | 2 | 0.36 | 0.44 | 0.41 | 0.72 |
NVIDIA RTX2080, oneAPI | 2 | 0.90 | 0.94 | 0.94 | 0.95 |
AMD Instinct MI50 | 3 | 0.98 | 1.01 | 0.99 | 0.99 |
Intel Xe Max: OneAPI 2022.1.0, OpenCL backend, compute-runtime 21.47.21710, intel-gpu02
. Slow-down is likely related to multiple global atomics to the same location from the same work-group (!1741 (merged)). Thus, 1-step reduction is not used with cl_Size == 4.
NVIDIA RTX2080: hipSYCL 7102e5e + Clang 14.0, IntelLLVM 4a794df, CUDA 11.4, dev-gpu04
. oneAPI kernels are ~10x slower than hipSYCL ones, so not the most relevant metric.
AMD Instinct MI50: hipSYCL 7102e5e, ROCm 4.5.2, dev-gpu06
.
Refs #3847
Edited by Andrey Alekseenko