Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Resumable Jobs for CI and Agent Sessions
## Problem Statement
With [Remote Access into Jobs for Agents & Humans](https://gitlab.com/groups/gitlab-org/-/epics/21160), humans and AI agents can SSH into live Runner job environments. However, jobs today are ephemeral: once a job finishes or is interrupted, the entire environment (filesystem, dependencies, build artifacts) is discarded. This causes:
1. **Idle resource waste**: When a job pauses for human review (e.g., a HIL checkpoint), the Runner sits idle consuming compute.
2. **Cold-start latency on resume**: If the job is terminated to free resources, resuming means re-provisioning, re-cloning, and rebuilding state from scratch (typically 2-5 minutes), which is unacceptable for interactive workflows where a developer wants to jump into an agent session immediately.
These costs compound across workflows. An agent task with multiple HIL checkpoints, a developer iterating on a CI failure, or parallel agent sessions each multiply the cold-start penalty. Reducing environment creation time by 80-90% through fast resume would save significant compute and developer wait-time across CI jobs, AI agent sessions, and HIL handoffs alike.
## Proposal
Introduce **Resumable Jobs**: snapshot a job's state, persist it, and restore it into a new job that picks up exactly where the previous one left off with near-instant startup.
Three core capabilities:
1. **Job State Snapshotting** — Capture the working volume (repo, dependencies, build outputs), job context metadata (env vars, working directory, shell history, agent session state), and an execution checkpoint marking where to resume.
2. **Fast Resume via Volume Reattachment** — Attach the persisted volume to a new Runner, inject context metadata, and begin execution from the checkpoint, skipping clone/setup entirely.
3. **Pause/Resume Lifecycle** — A **Pause → Snapshot → Terminate → Resume** lifecycle triggered by the system (idle timeout), an agent (awaiting human input), or a user. Includes configurable retention policies (TTL, storage limits) and cleanup of orphaned snapshots.
The Scope of this is to enable Pause/Resume for the current architecture of GitLab CI and runner iterate quickly and to unblock the [HITL use case for Agent Sessions](https://gitlab.com/groups/gitlab-org/-/epics/20652). See [this proposal](https://gitlab.com/groups/gitlab-org/-/epics/20652#note_3136170194) for an approach using `suspend_on_conclusion` and `environment_key` in the Workload framework.
In the future, resumable jobs should be baked into AutoFlow's design from the start. AutoFlow-specific scope will be tracked separately.
## Use Cases
- **Human-in-the-Loop (HIL)**: An agent snapshots the environment at a decision point, terminates the job, and surfaces a resume link in the MR. The developer clicks it and is dropped into the exact environment in seconds, not minutes.
- **Cost-Efficient Agent Sessions**: Agents on multi-step tasks pause during natural idle periods (waiting for CI results, review feedback) and resume without losing progress, turning always-on sessions into pay-for-what-you-use.
- **CI Debugging**: A developer snapshots a failing job at the point of failure and resumes later (or hands off to a colleague) without re-running the entire pipeline.
- **Prewarmed Environments**: Frequently used configurations (toolchain + dependencies) are snapshotted as warm baselines, dramatically reducing cold-start for new jobs.
## Success Metrics
| Metric | Target |
|--------|--------|
| Environment resume time | \< 10s environment setup independent of project size |
| AI Flow with HITL | 1 functioning AI Flow with human-in-the-loop shipped using pause/resume capability |
| Agent session cost efficiency | Agent idle periods (awaiting review, CI results) release compute; no always-on resource consumption during pauses |
| Agent session continuity | Agent resumes with full context (filesystem, env vars, session state) after pause; 0 lost progress |
epic