Implement generic j-reduction in nbnxm SYCL kernels
This commit implements the generic j-reduction identical to the OpenCL version of the same.
Also added a subGroupBarrier() helper which is needed for correctness on the CUDA backend when targetting NVIDIA architectures.
Refs #3934
Edited by Szilárd Páll