Draft: SYCL GPU Pairlist sorting
Improve NBNXM kernel performance by making sure the work is more evenly distributed.
The approach closely follows CUDA / HIP approach.
sycl::popcount
implementation added for AdaptiveCpp 23.10 and earlier.
Prefix sum inspired by oneDPL. They use sycl::joint_exclusive_scan
over a single work-group for small inputs, and our current histogram
size if small enough, so we just do the same instead of pulling in
the whole library.
Still draft, need more correctness / performance checks.
Fixes #4979