Skip to content

Draft: Allow CPU force clearing to overlap with GPU work

Szilárd Páll requested to merge sz_move_force_clearing_rebase_main into main

Force clearing can now overlap with GPU wait improving performance performance when CPU compute and transfers are on the critical path.

To avoid a race condition between step N CPU force H2D copy and step N+1 CPU force buffer clearing an wait is needed. Previously correctness was ensured by an implicit and undocumented dependency ensured by executing force clearing after blocking wait on the coordinates from the GPU update to arrive (which also prevent overlap of force clearing).

This commit adds an explicit wait, but to avoid adding yet another cross-step event dependency it uses the already existing xUpdatedOnDevice event which marks the start of the current step. The drawback of this approach is that it only allows partial overlap, not with update/constraints, since the clearing could start as soon as force H2D is done.

Also added missing cycle counting around the existing CPU-side sync on xUpdatedOnDevice (previously only used with graph scheduling).

Measured 10% performance improvement with a large CG system on 1 GPU.

Merge request reports