Add suspend/resume support for instance autoscaler executor
Summary
Adds the foundation for suspend/resume of job environments for instance autoscaler executor. A job that opts in can preserve its workload state (cloud instance, working tree, executor state) when it finishes; a follow-up job can pick up where the previous one left off instead of starting from scratch.
All changes are gated behind the FF_SUSPENDABLE_ENVIRONMENTS feature flag (default off).
How it works
A job opts in via suspend_on_success and/or suspend_on_failure in its job options. When the build
finishes and the matching condition is met, the runner asks the executor to preserve workload state
and emits a serializable environment key identifying the suspended environment. The key has the
format <runner-id>/<url-encoded-system-id>/<url-encoded-fields> and routes back to the originating
runner instance.
A subsequent job that supplies the matching environment key in its environment_key job option
resumes the same workload. The runner skips fetching sources on resume so the preserved working tree
isn't clobbered.
When FF_SUSPENDABLE_ENVIRONMENTS is off, the suspend options are silently ignored — no job failure.
Scope
- Foundation: feature flag,
common.SuspendableExecutorinterface,common.EnvironmentKeytype with wire format, Build orchestration, gating accessors on*Build. - Autoscaler / Instance executor: suspend marks the taskscaler slot suspended so the cloud instance is preserved across job boundaries; resume routes back to the same slot.
Out of scope (follow-ups)
- Env-key artifact upload: today the runner logs the env key but does not yet upload it to GitLab for delivery to a follow-up job. Resume from an externally-supplied env key already works.
- Docker autoscaler: build-container-only preservation is a leaky abstraction (sidecars, networks, volumes lost on suspend). Full preservation needs a separate design pass.
Job options
| Option | Type | Purpose |
|---|---|---|
suspend_on_success |
bool | Suspend after successful job completion |
suspend_on_failure |
bool | Suspend after job failure (e.g. for debugging) |
environment_key |
string | Resume key from a prior suspension |
Failure behavior
If suspension fails on an otherwise-successful job, the job is reported back to GitLab as a system- level failure. If the job has already failed, the suspension failure is logged but does not override the original failure reason.
Testing
1. Build the fleeting-plugin-aws
cd fleeting-plugin-aws
make buildAdd to your PATH (or copy to a directory already in PATH):
export PATH="$PWD/out:$PATH"
# Verify
fleeting-plugin-aws --version2. Build gitlab-runner
cd gitlab-runner
go build -o gitlab-runner ./cmd/gitlab-runner3. Run Rails on vtak/suspended_environments branch
Checkout Add suspend/resume support to GitLab (gitlab!230505 - closed) for the Rails changes (integration testing MR).
4. Configure the runner
Create config.toml:
concurrent = 10
check_interval = 0
connection_max_age = "15m0s"
log_level = "info"
shutdown_timeout = 0
[[runners]]
name = "vtak-local-suspendable-environments"
url = "http://gdk.test:3000"
id = 34
token = "glrt-REDACTED"
token_obtained_at = 2026-03-24T08:44:45Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "instance"
builds_dir = "/home/ubuntu/builds"
[runners.cache]
MaxUploadedArchiveSize = 0
[runners.autoscaler]
plugin = "fleeting-plugin-aws"
capacity_per_instance = 2
max_instances = 20
instance_ready_command = "timeout 120 bash -c 'until which step-runner; do sleep 2; done'"
reservation_throttling = false
[[runners.autoscaler.policy]]
preemptive_mode = false
[runners.autoscaler.plugin_config]
name = "fleeting-plugin-aws-asg"
profile = "eng-dev-sandbox-vtak"
region = "ap-south-1"
scale_in_termination = true
[runners.autoscaler.connector_config]
username = "ubuntu"
key_path = "/Users/vtak/.ssh/eng-dev-sandbox-vtak-ap-south-1-key-pair.pem"
use_external_addr = true
dial_timeout = "2m0s"5. Network: Autoscaler instances to GDK
If GDK runs locally on your laptop, autoscaler instances (AWS) can't resolve
gdk.test. Route through a sandbox relay.
5.1. Autoscaler instances
Add to the ASG launch template user data:
echo "<sandbox-public-ip> gdk.test" >> /etc/hosts5.2. Sandbox (one-time setup)
Enable remote port forwarding:
grep '^GatewayPorts' /etc/ssh/sshd_config || echo 'GatewayPorts yes' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshEnsure the security group allows inbound TCP 3000 from the autoscaler VPC CIDR.
5.3. Your laptop
Open the tunnel (keep this terminal open):
ssh -R 0.0.0.0:3000:gdk.test:3000 ubuntu@<sandbox-public-ip>5.4. Verify
From the sandbox:
curl -s -o /dev/null -w '%{http_code}' http://localhost:3000
# Should return 200 or 3026. Start the runner
./gitlab-runner run --config config.toml7. One-time Rails console setup
Feature.disable(:duo_runner_restrictions)8. Trigger a suspend job
user = User.find_by_username("root")
project = Project.find_by_path("gitlab-test")
suspend_d = Ci::Workloads::WorkloadDefinition.new do |d|
d.image = "dummy"
d.commands = [
'echo "hello from suspended workload at $(date)" >> "${PWD}/hello.log"',
'cat "${PWD}/hello.log"'
]
d.tags = ["vtak-local-suspendable-environments"]
d.add_variable("FF_SUSPENDABLE_ENVIRONMENTS", "true")
d.suspend_on_success = true
d.suspend_on_failure = true
end
suspend_workload = Ci::Workloads::RunWorkloadService.new(
project: project, current_user: user,
source: :duo_workflow, workload_definition: suspend_d
).execute9. Verify build options
suspend_pipeline_id = suspend_workload.payload.pipeline_id
suspend_build = Ci::Pipeline.find(suspend_pipeline_id).builds.first
suspend_build.options
# Should include :suspend_on_success => true
presenter = Ci::BuildRunnerPresenter.new(suspend_build)
presenter.suspend_options
# Should return { suspend_on_success: true, suspend_on_failure: true }10. Get the environment key
After the job completes, the runner suspends the environment and logs the environment key:
Job environment suspended: 123/s_456/acquisition-key=ABCCopy the full environment key value for the next step.
11. Trigger a resume job
Replace the environment_key with the environment key from the runner logs:
resume_d = Ci::Workloads::WorkloadDefinition.new do |d|
d.image = "dummy"
d.commands = [
'echo "${PWD}"',
'cat "${PWD}/hello.log"'
]
d.tags = ["vtak-local-suspendable-environments"]
d.add_variable("FF_SUSPENDABLE_ENVIRONMENTS", "true")
d.environment_key = 'UPDATE_ME'
end
resume_workload = Ci::Workloads::RunWorkloadService.new(
project: project, current_user: user,
source: :duo_workflow, workload_definition: resume_d
).execute12. Verify build options
resume_pipeline_id = resume_workload.payload.pipeline_id
resume_build = Ci::Pipeline.find(resume_pipeline_id).builds.first
resume_build.options
# Should include :environment_key => "..."
presenter = Ci::BuildRunnerPresenter.new(resume_build)
presenter.suspend_options
# Should return { suspend_on_success: false, suspend_on_failure: false, environment_key: "X5eakR52n/..." }Screenshots
NOTE
- Since the runner config has
capacity_per_instanceset to2, the first 2 jobs are scheduled on the same instance. The third job is scheduled on a different instance. - The resumed environment is always on the same instance on which it was suspended on. It inherits the entire filesystem of the suspended job as well.





