SYCL: Enable native atomics for DPC++/CUDA
Refs https://github.com/intel/llvm/issues/5936
Testing on V100, with 384k water box and mid-January IntelLLVM. Shuffle-based reduction (!2571 (merged)) included.
Kernel runtime compared to CUDA-Clang (lower is better):
NB F PME | NB FV PME | NB F RF | NB FV RF | |
---|---|---|---|---|
Before | +57% | +1300% | +131% | +1900% |
After | +23% | +44% | +90% | +137% |
On smaller systems, the difference is less dramatic.