GPU PME pipelining: Spread does not wait for grid clearing
pme_gpu_clear_gridsschedules the clearing of PME grids asynchronously in the "main" PME stream;
- It is called, via
pme_gpu_reinit_computationat the end of the step (PME-only caller; PP-PME caller);
- When pipelining is used, we launch Spread kernels in the dedicated streams, only synchronizing with coordinate readiness, but don't check that grids have completed clearing.
- This is rarely (if ever) a problem for native CUDA but causes races with Open SYCL due to caching.
Exact steps to reproduce
See #4733 (closed).
For developers: Why is this important?
Even on CUDA where it does not seem to happen often, we should make sure the dependencies between streams are correctly set up.