Implement CUDA Graph functionality and perform associated refactoring
As it stands, GROMACS launches each CUDA activity independently. For small cases, the CPU overheads associated with launches are on the critical path, such that the GPU is "starved" of work.
CUDA Graphs aims to address this problem by allowing multiple activities to be launched as a single "graph", such that a single CPU API call can launch multiple GPU activities. There are also potential benefits on the GPU side: since CUDA has more awareness about the workflow it can optimize execution and reduce GPU-side launch latencies.
Graphs support for fully async cases:
-
Investigate issue where "39b9e167 Use existing PME f ready event in PmeForceSenderGpu" change breaks ability to capture graphs across multiple GPUs -
Rebase prototype to latest GROMACS. -
Allow each graph to span multiple steps, avoiding unnecessary barriers across GPUs between steps.[We instead overlap the start and end of each step through separate graphs on odd and even steps, see below] -
Implement graph update functionality for multi-GPU case to avoid expensive re-instantiation every NS/DD step.[Will be properly supported in a future CUDA version and will use same codepath as single-GPU, we just need to trivially enable when possible] -
Investigate node-level priorities -
Clean up and refactor all code, bringing standard up to that which can be upstreamed. -
Create unit test for new MD Graphs class -
Adding some assertions on the expected values of these whenever the state ofMdGpuGraph
changes from "using" to "recording" to "recorded but not created yet" to "created" etc. -
add simple test(s) which can further validate above point that by checking valid/invalid order of calls
Extending to allow cases with CPU forces: Prototype at Prototype at 0030776d
-
Refactoring in do_force/do_md to separate out CPU force calculations and allow them to be called directly from do_md. -
Investigate and implement a mechanism using CUDA stream memory operations to interleave CUDA graph launch with CPU force calculations, allowing required dependencies to be met. -
Finalize code and upstream
Edited by Alan Gray