Resumable workloads for DAP flows
## Problem Statement
For the MVP version of [Approval Human In The Loop Node (HITL)](https://gitlab.com/groups/gitlab-org/-/work_items/20652), DAP flows handle resume actions by creating a brand new workload each time. This approach has significant drawbacks:
- **Dependency reinstall overhead**: Every resume re-installs DAP dependencies and re-sets up the git and SRT environment from scratch, adding minutes of cold-start delay.
- **Loss of WIP changes**: Any in-progress file changes made by the agent in the current environment are discarded. The agent must redo that work based on workflow checkpoints alone.
- **Resource inefficiency**: Keeping a workload alive while waiting for human input wastes compute; tearing it down and recreating it wastes time.
## Proposal
Leverage [Resumable Jobs for CI and Agent Sessions](https://gitlab.com/groups/gitlab-org/-/work_items/21159) to introduce true suspend/resume semantics for DAP workloads.
When a DAP flow reaches a `human_request` node, the underlying Runner job is **suspended** rather than terminated. The full environment state — git repo, file changes, installed dependencies, agent session context — is snapshotted and persisted. When `human_input` is received, the same workload **resumes** from exactly where it left off, with near-instant startup and zero rework.
This turns the current create-new-workload-on-every-resume pattern into a true pause/resume lifecycle:
```
human_request node hit → Suspend workload → Persist environment snapshot
human_input received → Resume workload → Restore environment, continue execution
```
## Infrastructure Dependency
Resumable Jobs require a modern Runner executor that supports suspend/resume semantics. The current fleet serving the `gitlab--duo` CI tag uses the legacy **docker+machine** executor, which does not support this. A dedicated fleet using a modern executor must be provisioned to unblock DAP teams.
Supported modern executors:
- **Docker Autoscaler** (via Fleeting + Taskscaler) — targeted for %19.0
- **Instance Autoscaler** (via Fleeting + Taskscaler)
- **Kubernetes** (planned, with gVisor isolation)
This is tracked in [Modern executor fleet on .com for Resumable Jobs and AI workloads](https://gitlab.com/gitlab-org/gitlab/-/issues/597039) and [Provision a runner fleet with a modern executor for the `duo` tag](https://gitlab.com/gitlab-org/gitlab/-/issues/597038).
## Key Constraints
- A resumed job must return to **exactly the runner that suspended it**, since only that runner's taskscaler holds the suspended acquisition. This routing enforcement is tracked in [Ensure resumed CI jobs return to the correct runner](https://gitlab.com/gitlab-org/gitlab/-/issues/596841).
- The DAP team is shipping an initial HITL MVC in %18.11 using a [stopgap that does not require runner-side suspend/resume](https://gitlab.com/groups/gitlab-org/-/epics/20652#note_3150140098). The full environment-preserving experience depends on Resumable Jobs being available by %19.0.
## Benefits Over Current Approach
| | Current (new workload per resume) | With Resumable Workloads |
|--|-----------------------------------|--------------------------|
| Resume startup time | 2-5 min (reinstall deps, re-clone) | \< 10s (volume reattach) |
| WIP file changes | Lost on suspend | Preserved across pause |
| Installed dependencies | Reinstalled every time | Persisted in snapshot |
| Git environment | Re-cloned every time | Preserved in snapshot |
| Compute during pause | Idle (wasted) or torn down | Released; no cost during pause |
## Success Metrics
| Metric | Target |
|--------|--------|
| Workload resume time | \< 10s from `human_input` received to agent continuing execution |
| Environment continuity | Git state, file changes, and installed dependencies fully preserved across pause/resume; 0 agent rework required |
| Compute efficiency | No compute consumed during the pause window between `human_request` and `human_input` |
| HITL flow shipped | 1 end-to-end DAP flow with HITL using suspend/resume delivered by %"19.10" |
## Related Epics and Issues
- [Resumable Jobs for CI and Agent Sessions](https://gitlab.com/groups/gitlab-org/-/work_items/21159) — the underlying Runner capability this epic depends on
- [Approval Human In The Loop Node (HITL)](https://gitlab.com/groups/gitlab-org/-/work_items/20652) — the DAP flow feature this epic unblocks
- [Remote Access into Jobs for Agents & Humans](https://gitlab.com/groups/gitlab-org/-/work_items/21160) — complementary capability enabling SSH access into live Runner environments
- [Blueprint - Resumable Jobs for CI and Agent Sessions](https://gitlab.com/gitlab-org/gitlab/-/issues/593314) — architecture blueprint
- [Ensure resumed CI jobs return to the correct runner](https://gitlab.com/gitlab-org/gitlab/-/issues/596841) — runner routing enforcement
- [Modern executor fleet on .com for Resumable Jobs and AI workloads](https://gitlab.com/gitlab-org/gitlab/-/issues/597039) — .com fleet provisioning
- [Provision a runner fleet with a modern executor for the `duo` tag](https://gitlab.com/gitlab-org/gitlab/-/issues/597038) — dedicated `duo` tag fleet
epic