Resumable Jobs for CI and Agent Sessions (#21159) · Epics · GitLab.org

Resumable Jobs for CI and Agent Sessions

## Problem Statement With [Remote Access into Jobs for Agents & Humans](https://gitlab.com/groups/gitlab-org/-/epics/21160), humans and AI agents can SSH into live Runner job environments. However, jobs today are ephemeral: once a job finishes or is interrupted, the entire environment (filesystem, dependencies, build artifacts) is discarded. This causes: 1. **Idle resource waste**: When a job pauses for human review (e.g., a HIL checkpoint), the Runner sits idle consuming compute. 2. **Cold-start latency on resume**: If the job is terminated to free resources, resuming means re-provisioning, re-cloning, and rebuilding state from scratch (typically 2-5 minutes), which is unacceptable for interactive workflows where a developer wants to jump into an agent session immediately. These costs compound across workflows. An agent task with multiple HIL checkpoints, a developer iterating on a CI failure, or parallel agent sessions each multiply the cold-start penalty. Reducing environment creation time by 80-90% through fast resume would save significant compute and developer wait-time across CI jobs, AI agent sessions, and HIL handoffs alike. ## Proposal Introduce **Resumable Jobs**: snapshot a job's state, persist it, and restore it into a new job that picks up exactly where the previous one left off with near-instant startup. Three core capabilities: 1. **Job State Snapshotting** — Capture the working volume (repo, dependencies, build outputs), job context metadata (env vars, working directory, shell history, agent session state), and an execution checkpoint marking where to resume. 2. **Fast Resume via Volume Reattachment** — Attach the persisted volume to a new Runner, inject context metadata, and begin execution from the checkpoint, skipping clone/setup entirely. 3. **Pause/Resume Lifecycle** — A **Pause → Snapshot → Terminate → Resume** lifecycle triggered by the system (idle timeout), an agent (awaiting human input), or a user. Includes configurable retention policies (TTL, storage limits) and cleanup of orphaned snapshots. The scope of this epic is to enable Pause/Resume for the current architecture of GitLab CI and Runner, to iterate quickly and to unblock the [HITL use case for Agent Sessions](https://gitlab.com/groups/gitlab-org/-/epics/20652). See [this proposal](https://gitlab.com/groups/gitlab-org/-/epics/20652#note_3136170194) for an approach using `suspend_on_conclusion` and `environment_key` in the Workload framework. In the future, resumable jobs should be baked into AutoFlow's design from the start. AutoFlow-specific scope will be tracked separately. ## Use Cases - **Human-in-the-Loop (HIL)**: An agent snapshots the environment at a decision point, terminates the job, and surfaces a resume link in the MR. The developer clicks it and is dropped into the exact environment in seconds, not minutes. - **Cost-Efficient Agent Sessions**: Agents on multi-step tasks pause during natural idle periods (waiting for CI results, review feedback) and resume without losing progress, turning always-on sessions into pay-for-what-you-use. - **CI Debugging**: A developer snapshots a failing job at the point of failure and resumes later (or hands off to a colleague) without re-running the entire pipeline. - **Prewarmed Environments**: Frequently used configurations (toolchain + dependencies) are snapshotted as warm baselines, dramatically reducing cold-start for new jobs. ## Success Metrics | Metric | Target | |--------|--------| | Environment resume time | \< 10s environment setup independent of project size | | AI Flow with HITL | 1 functioning AI Flow with human-in-the-loop shipped using pause/resume capability | | Agent session cost efficiency | Agent idle periods (awaiting review, CI results) release compute; no always-on resource consumption during pauses | | Agent session continuity | Agent resumes with full context (filesystem, env vars, session state) after pause; 0 lost progress |

epic