GPU PME Spread pipelining broken in SYCL

Summary

In SYCL, when doing charge spreading, the atom offsets are not applied. This leads to incorrect results when PME pipelining is used:

PME is offloaded to GPU, and
direct GPU communication is active, and
several PP ranks are used.

This leads to incorrect results.

Exact steps to reproduce

On gpu12:

$ module load cuda/11.7.1 cmake/3.24.2 ninja/1.10.0 openmpi/1.8.8-cuda6.5 clang/15.0.0 boost/1.75.0 /nethome/aland/modules/modulefiles/hipSYCL/0.9.4-cuda11.7.1 

$ cmake ../.. -DCMAKE_CXX_COMPILER=clang++-15 -DCMAKE_C_COMPILER=clang-15  -DCMAKE_BUILD_TYPE=Release -DGMX_GPU=SYCL -DGMX_SYCL_HIPSYCL=ON -DHIPSYCL_TARGETS='cuda:sm_61,sm_70' -DGMX_MPI=ON

$ GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1 mpirun -np 3 gmx_mpi mdrun -nb gpu -pme gpu -update gpu -npme 1  -nsteps 1000  -ntomp 8 -pin on

The energy drift in the output file is around 1e-1 kJ/mol/ps for a 384k water box, way higher than normal, ~1e-4 kJ/mol/ps.

For developers: Why is this important?

Violating physics is not great.