SYCL nbnxm: use readfirstlane AMD builtin
Use the readfirstlane AMD builtin to force a uniform load of exclusion index and interaction masks and with that avoiding vector registers and vector operations. Some recent ROCm compilers like v5.3 do optimize automatically the former but earlier don't, bu imask loads don't get optimized even more recent ROCm.
Observed performance improvements of up to 8% in interaction kernels and 5-15% on prune kernels on gfx90a; on older arch like gfx803 the latter improves by up to 30%.