AdaptiveCpp runs fail with hip_queue: hipMemsetAsync() failed (error code = HIP:1)
## Summary GROMACS built with AdaptiveCpp usually runs fine on Dardel, but with some inputs fails mid-run with: ``` [hipSYCL Error] from /tmp/hellsvik/hipsycl/0.9.4/cpeGNU-22.06-rocm-5.3.3-llvm/hipSYCL-0.9.4/src/runtime/hip/hip_queue.cpp:355 @ submit_memset() : hip_queue: hipMemsetAsync() failed (error code = HIP:1) ============== hipSYCL error report ============== hipSYCL has caught the following undhandled asynchronous errors: from /tmp/hellsvik/hipsycl/0.9.4/cpeGNU-22.06-rocm-5.3.3-llvm/hipSYCL-0.9.4/src/runtime/hip/hip_queue.cpp:355 @ submit_memset(): hip_queue: hipMemsetAsync() failed (error code = HIP:1) The application will now be terminated. terminate called without an active exception srun: error: nid002892: task 39: Aborted (core dumped) srun: launch/slurm: _step_signal: Terminating StepId=2789083.0 slurmstepd: error: *** STEP 2789083.0 ON nid002892 CANCELLED AT 2023-11-03T01:03:07 *** srun: error: nid002892: tasks 0-38,40-63: Killed srun: Force Terminated StepId=2789083.0 ``` - https://gromacs.bioexcel.eu/t/hipsycl-error-when-running-with-multiple-mpis/7549 - @AndPdb92 observed similar errors with GROMACS 2024.1 + AdaptiveCpp 23.10, remediated by `HIPSYCL_RT_MAX_CACHED_NODES=5`. ## For developers: Why is this important? GROMACS should not crash mid-run. ## Possible fixes Here is what happens: - On the step before an NS step, [force buffer clearing is launched](https://gitlab.com/gromacs/gromacs/blob/1fdbafbcc01e8ebb4a2b3ea4bae3aa9e7f8c3493/src/gromacs/nbnxm/nbnxm_gpu_data_mgmt.cpp#L740). - It is not submitted to GPU immediately, but cached in ACpp's task graph. - Then, on the NS step, [force and coordinate buffers are reallocated](https://gitlab.com/gromacs/gromacs/blob/1fdbafbcc01e8ebb4a2b3ea4bae3aa9e7f8c3493/src/gromacs/nbnxm/nbnxm_gpu_data_mgmt.cpp#L658). - Then, [force buffer clearing for the new buffer is launched](https://gitlab.com/gromacs/gromacs/blob/1fdbafbcc01e8ebb4a2b3ea4bae3aa9e7f8c3493/src/gromacs/nbnxm/nbnxm_gpu_data_mgmt.cpp#L694). - Then the task graph is flushed: - the memset operating on the old pointer is submitted (but does not return an error immediately), - the memset operating on the new buffer returns asynchronous error from the previous operation, `hipErrorInvalidValue`, - and exception is thrown and GROMACS crashes. In CUDA it does not happen since, IIRC, memory (de)allocation does device-wide synchronization anyway. And even if synchronization is not forced, the buffer clearing is highly likely to succeed since it's quite fast. Same is true for low `HIPSYCL_RT_MAX_CACHED_NODES` and (theoretically) the instant-submission mode: if we don't delay flushing the task graph, things should work fine. A solution would be to add Local Stream synchronization before buffer reallocation. I do not think we even intend to explicitly rely on any synchronization before memory (de)allocations, so this would be a good practice anyway.
issue