GPU PME pipelining: Spread does not wait for grid clearing
Summary
-
pme_gpu_clear_grids
schedules the clearing of PME grids asynchronously in the "main" PME stream; - It is called, via
pme_gpu_reinit_computation
at the end of the step (PME-only caller; PP-PME caller); - When pipelining is used, we launch Spread kernels in the dedicated streams, only synchronizing with coordinate readiness, but don't check that grids have completed clearing.
- This is rarely (if ever) a problem for native CUDA but causes races with Open SYCL due to caching.
Exact steps to reproduce
See #4733 (closed).
For developers: Why is this important?
Even on CUDA where it does not seem to happen often, we should make sure the dependencies between streams are correctly set up.