Improves nbnxm kernel performance by up to 10%/6% (F/VF) on gfx90a and 5-6% on gfx908.
DPP update-based shuffle function is added to the SYCL kernel utils so it can be reused elsewhere.
Refs #3847 #3934