Resumable workloads for DAP flows
## Problem Statement For the MVP version of [Approval Human In The Loop Node (HITL)](https://gitlab.com/groups/gitlab-org/-/work_items/20652), DAP flows handle resume actions by creating a brand new workload each time. This approach has significant drawbacks: - **Dependency reinstall overhead**: Every resume re-installs DAP dependencies and re-sets up the git and SRT environment from scratch, adding minutes of cold-start delay. - **Loss of WIP changes**: Any in-progress file changes made by the agent in the current environment are discarded. The agent must redo that work based on workflow checkpoints alone. - **Resource inefficiency**: Keeping a workload alive while waiting for human input wastes compute; tearing it down and recreating it wastes time. ## Proposal Leverage [Resumable Jobs for CI and Agent Sessions](https://gitlab.com/groups/gitlab-org/-/work_items/21159) to introduce true suspend/resume semantics for DAP workloads. When a DAP flow reaches a `human_request` node, the underlying Runner job is **suspended** rather than terminated. The full environment state — git repo, file changes, installed dependencies, agent session context — is snapshotted and persisted. When `human_input` is received, the same workload **resumes** from exactly where it left off, with near-instant startup and zero rework. This turns the current create-new-workload-on-every-resume pattern into a true pause/resume lifecycle: ``` human_request node hit → Suspend workload → Persist environment snapshot human_input received → Resume workload → Restore environment, continue execution ``` ## Infrastructure Dependency Resumable Jobs require a modern Runner executor that supports suspend/resume semantics. The current fleet serving the `gitlab--duo` CI tag uses the legacy **docker+machine** executor, which does not support this. A dedicated fleet using a modern executor must be provisioned to unblock DAP teams. Supported modern executors: - **Docker Autoscaler** (via Fleeting + Taskscaler) — targeted for %19.0 - **Instance Autoscaler** (via Fleeting + Taskscaler) - **Kubernetes** (planned, with gVisor isolation) This is tracked in [Modern executor fleet on .com for Resumable Jobs and AI workloads](https://gitlab.com/gitlab-org/gitlab/-/issues/597039) and [Provision a runner fleet with a modern executor for the `duo` tag](https://gitlab.com/gitlab-org/gitlab/-/issues/597038). ## Key Constraints - A resumed job must return to **exactly the runner that suspended it**, since only that runner's taskscaler holds the suspended acquisition. This routing enforcement is tracked in [Ensure resumed CI jobs return to the correct runner](https://gitlab.com/gitlab-org/gitlab/-/issues/596841). - The DAP team is shipping an initial HITL MVC in %18.11 using a [stopgap that does not require runner-side suspend/resume](https://gitlab.com/groups/gitlab-org/-/epics/20652#note_3150140098). The full environment-preserving experience depends on Resumable Jobs being available by %19.0. ## Benefits Over Current Approach | | Current (new workload per resume) | With Resumable Workloads | |--|-----------------------------------|--------------------------| | Resume startup time | 2-5 min (reinstall deps, re-clone) | \< 10s (volume reattach) | | WIP file changes | Lost on suspend | Preserved across pause | | Installed dependencies | Reinstalled every time | Persisted in snapshot | | Git environment | Re-cloned every time | Preserved in snapshot | | Compute during pause | Idle (wasted) or torn down | Released; no cost during pause | ## Success Metrics | Metric | Target | |--------|--------| | Workload resume time | \< 10s from `human_input` received to agent continuing execution | | Environment continuity | Git state, file changes, and installed dependencies fully preserved across pause/resume; 0 agent rework required | | Compute efficiency | No compute consumed during the pause window between `human_request` and `human_input` | | HITL flow shipped | 1 end-to-end DAP flow with HITL using suspend/resume delivered by %"19.10" | ## Related Epics and Issues - [Resumable Jobs for CI and Agent Sessions](https://gitlab.com/groups/gitlab-org/-/work_items/21159) — the underlying Runner capability this epic depends on - [Approval Human In The Loop Node (HITL)](https://gitlab.com/groups/gitlab-org/-/work_items/20652) — the DAP flow feature this epic unblocks - [Remote Access into Jobs for Agents & Humans](https://gitlab.com/groups/gitlab-org/-/work_items/21160) — complementary capability enabling SSH access into live Runner environments - [Blueprint - Resumable Jobs for CI and Agent Sessions](https://gitlab.com/gitlab-org/gitlab/-/issues/593314) — architecture blueprint - [Ensure resumed CI jobs return to the correct runner](https://gitlab.com/gitlab-org/gitlab/-/issues/596841) — runner routing enforcement - [Modern executor fleet on .com for Resumable Jobs and AI workloads](https://gitlab.com/gitlab-org/gitlab/-/issues/597039) — .com fleet provisioning - [Provision a runner fleet with a modern executor for the `duo` tag](https://gitlab.com/gitlab-org/gitlab/-/issues/597038) — dedicated `duo` tag fleet
epic