AdaptiveCpp runs fail with hip_queue: hipMemsetAsync() failed (error code = HIP:1)
## Summary
GROMACS built with AdaptiveCpp usually runs fine on Dardel, but with some inputs fails mid-run with:
```
[hipSYCL Error] from /tmp/hellsvik/hipsycl/0.9.4/cpeGNU-22.06-rocm-5.3.3-llvm/hipSYCL-0.9.4/src/runtime/hip/hip_queue.cpp:355 @ submit_memset() : hip_queue: hipMemsetAsync() failed (error code = HIP:1)
============== hipSYCL error report ==============
hipSYCL has caught the following undhandled asynchronous errors:
from /tmp/hellsvik/hipsycl/0.9.4/cpeGNU-22.06-rocm-5.3.3-llvm/hipSYCL-0.9.4/src/runtime/hip/hip_queue.cpp:355 @ submit_memset(): hip_queue: hipMemsetAsync() failed (error code = HIP:1)
The application will now be terminated.
terminate called without an active exception
srun: error: nid002892: task 39: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=2789083.0
slurmstepd: error: *** STEP 2789083.0 ON nid002892 CANCELLED AT 2023-11-03T01:03:07 ***
srun: error: nid002892: tasks 0-38,40-63: Killed
srun: Force Terminated StepId=2789083.0
```
- https://gromacs.bioexcel.eu/t/hipsycl-error-when-running-with-multiple-mpis/7549
- @AndPdb92 observed similar errors with GROMACS 2024.1 + AdaptiveCpp 23.10, remediated by `HIPSYCL_RT_MAX_CACHED_NODES=5`.
## For developers: Why is this important?
GROMACS should not crash mid-run.
## Possible fixes
Here is what happens:
- On the step before an NS step, [force buffer clearing is launched](https://gitlab.com/gromacs/gromacs/blob/1fdbafbcc01e8ebb4a2b3ea4bae3aa9e7f8c3493/src/gromacs/nbnxm/nbnxm_gpu_data_mgmt.cpp#L740).
- It is not submitted to GPU immediately, but cached in ACpp's task graph.
- Then, on the NS step, [force and coordinate buffers are reallocated](https://gitlab.com/gromacs/gromacs/blob/1fdbafbcc01e8ebb4a2b3ea4bae3aa9e7f8c3493/src/gromacs/nbnxm/nbnxm_gpu_data_mgmt.cpp#L658).
- Then, [force buffer clearing for the new buffer is launched](https://gitlab.com/gromacs/gromacs/blob/1fdbafbcc01e8ebb4a2b3ea4bae3aa9e7f8c3493/src/gromacs/nbnxm/nbnxm_gpu_data_mgmt.cpp#L694).
- Then the task graph is flushed:
- the memset operating on the old pointer is submitted (but does not return an error immediately),
- the memset operating on the new buffer returns asynchronous error from the previous operation, `hipErrorInvalidValue`,
- and exception is thrown and GROMACS crashes.
In CUDA it does not happen since, IIRC, memory (de)allocation does device-wide synchronization anyway. And even if synchronization is not forced, the buffer clearing is highly likely to succeed since it's quite fast.
Same is true for low `HIPSYCL_RT_MAX_CACHED_NODES` and (theoretically) the instant-submission mode: if we don't delay flushing the task graph, things should work fine.
A solution would be to add Local Stream synchronization before buffer reallocation. I do not think we even intend to explicitly rely on any synchronization before memory (de)allocations, so this would be a good practice anyway.
issue