SYCL: Use shuffle-based i-force reduction (!2571) · Merge requests · GROMACS / GROMACS

Andrey Alekseenko requested to merge aa-sycl-nbnxm-i-shuffles into hwe-release-2022 Mar 01, 2022

Following CUDA implementation (for NVIDIA) and OpenCL implementation (for Intel).

Speed-up of NBNXM kernels on RNAse test system (RF electrostatics and PME electrostatics). Relative to 26005307. New average time / old average time, lower is better:

	# reduction steps	RF F	RF F+V	PME F	PME F+V
Intel Xe Max	1	1.07	7.49	1.05	4.28
NVIDIA RTX2080, hipSYCL	2	0.36	0.44	0.41	0.72
NVIDIA RTX2080, oneAPI	2	0.90	0.94	0.94	0.95
AMD Instinct MI50	3	0.98	1.01	0.99	0.99

Intel Xe Max: OneAPI 2022.1.0, OpenCL backend, compute-runtime 21.47.21710, intel-gpu02. Slow-down is likely related to multiple global atomics to the same location from the same work-group (!1741 (merged)). Thus, 1-step reduction is not used with cl_Size == 4.

NVIDIA RTX2080: hipSYCL 7102e5e + Clang 14.0, IntelLLVM 4a794df, CUDA 11.4, dev-gpu04. oneAPI kernels are ~10x slower than hipSYCL ones, so not the most relevant metric.

AMD Instinct MI50: hipSYCL 7102e5e, ROCm 4.5.2, dev-gpu06.

Refs #3847

Edited Jun 17, 2022 by Andrey Alekseenko

SYCL: Use shuffle-based i-force reduction

Merge request reports