Add suspend/resume support for instance autoscaler executor (!6570) · Merge requests · GitLab.org / gitlab-runner

Summary

Adds the foundation for suspend/resume of job environments for instance autoscaler executor. A job that opts in can preserve its workload state (cloud instance, working tree, executor state) when it finishes; a follow-up job can pick up where the previous one left off instead of starting from scratch.

All changes are gated behind the FF_SUSPENDABLE_ENVIRONMENTS feature flag (default off).

How it works

A job opts in via suspend_on_success and/or suspend_on_failure in its job options. When the build finishes and the matching condition is met, the runner asks the executor to preserve workload state and emits a serializable environment key identifying the suspended environment. The key has the format <runner-id>/<url-encoded-system-id>/<url-encoded-fields> and routes back to the originating runner instance.

A subsequent job that supplies the matching environment key in its environment_key job option resumes the same workload. The runner skips fetching sources on resume so the preserved working tree isn't clobbered.

When FF_SUSPENDABLE_ENVIRONMENTS is off, the suspend options are silently ignored — no job failure.

Scope

Foundation: feature flag, common.SuspendableExecutor interface, common.EnvironmentKey type with wire format, Build orchestration, gating accessors on *Build.
Autoscaler / Instance executor: suspend marks the taskscaler slot suspended so the cloud instance is preserved across job boundaries; resume routes back to the same slot.

Out of scope (follow-ups)

Env-key artifact upload: today the runner logs the env key but does not yet upload it to GitLab for delivery to a follow-up job. Resume from an externally-supplied env key already works.
Docker autoscaler: build-container-only preservation is a leaky abstraction (sidecars, networks, volumes lost on suspend). Full preservation needs a separate design pass.

Job options

Option	Type	Purpose
`suspend_on_success`	bool	Suspend after successful job completion
`suspend_on_failure`	bool	Suspend after job failure (e.g. for debugging)
`environment_key`	string	Resume key from a prior suspension

Failure behavior

If suspension fails on an otherwise-successful job, the job is reported back to GitLab as a system- level failure. If the job has already failed, the suspension failure is logged but does not override the original failure reason.

Testing

1. Build the fleeting-plugin-aws

cd fleeting-plugin-aws
make build

Add to your PATH (or copy to a directory already in PATH):

export PATH="$PWD/out:$PATH"

# Verify
fleeting-plugin-aws --version

2. Build gitlab-runner

cd gitlab-runner
go build -o gitlab-runner ./cmd/gitlab-runner

3. Run Rails on vtak/suspended_environments branch

Checkout Add suspend/resume support to GitLab (gitlab!230505 - closed) for the Rails changes (integration testing MR).

4. Configure the runner

Create config.toml:

concurrent = 10
check_interval = 0
connection_max_age = "15m0s"
log_level = "info"
shutdown_timeout = 0

[[runners]]
  name = "vtak-local-suspendable-environments"
  url = "http://gdk.test:3000"
  id = 34
  token = "glrt-REDACTED"
  token_obtained_at = 2026-03-24T08:44:45Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "instance"
  builds_dir = "/home/ubuntu/builds"
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.autoscaler]
    plugin = "fleeting-plugin-aws"
    capacity_per_instance = 2
    max_instances = 20
    instance_ready_command = "timeout 120 bash -c 'until which step-runner; do sleep 2; done'"
    reservation_throttling = false
    [[runners.autoscaler.policy]]
      preemptive_mode = false
    [runners.autoscaler.plugin_config]
      name          = "fleeting-plugin-aws-asg"
      profile       = "eng-dev-sandbox-vtak"
      region        = "ap-south-1"
      scale_in_termination = true
    [runners.autoscaler.connector_config]
      username = "ubuntu"
      key_path = "/Users/vtak/.ssh/eng-dev-sandbox-vtak-ap-south-1-key-pair.pem"
      use_external_addr = true
      dial_timeout = "2m0s"

5. Network: Autoscaler instances to GDK

If GDK runs locally on your laptop, autoscaler instances (AWS) can't resolve gdk.test. Route through a sandbox relay.

5.1. Autoscaler instances

Add to the ASG launch template user data:

echo "<sandbox-public-ip> gdk.test" >> /etc/hosts

5.2. Sandbox (one-time setup)

Enable remote port forwarding:

grep '^GatewayPorts' /etc/ssh/sshd_config || echo 'GatewayPorts yes' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart ssh

Ensure the security group allows inbound TCP 3000 from the autoscaler VPC CIDR.

5.3. Your laptop

Open the tunnel (keep this terminal open):

ssh -R 0.0.0.0:3000:gdk.test:3000 ubuntu@<sandbox-public-ip>

5.4. Verify

From the sandbox:

curl -s -o /dev/null -w '%{http_code}' http://localhost:3000
# Should return 200 or 302

6. Start the runner

./gitlab-runner run --config config.toml

7. One-time Rails console setup

Feature.disable(:duo_runner_restrictions)

8. Trigger a suspend job

user = User.find_by_username("root")
project = Project.find_by_path("gitlab-test")

suspend_d = Ci::Workloads::WorkloadDefinition.new do |d|
  d.image = "dummy"
  d.commands = [
    'echo "hello from suspended workload at $(date)" >> "${PWD}/hello.log"',
    'cat "${PWD}/hello.log"'
  ]
  d.tags = ["vtak-local-suspendable-environments"]
  d.add_variable("FF_SUSPENDABLE_ENVIRONMENTS", "true")
  d.suspend_on_success = true
  d.suspend_on_failure = true
end

suspend_workload = Ci::Workloads::RunWorkloadService.new(
  project: project, current_user: user,
  source: :duo_workflow, workload_definition: suspend_d
).execute

9. Verify build options

suspend_pipeline_id = suspend_workload.payload.pipeline_id
suspend_build = Ci::Pipeline.find(suspend_pipeline_id).builds.first

suspend_build.options
# Should include :suspend_on_success => true

presenter = Ci::BuildRunnerPresenter.new(suspend_build)
presenter.suspend_options
# Should return { suspend_on_success: true, suspend_on_failure: true }

10. Get the environment key

After the job completes, the runner suspends the environment and logs the environment key:

Job environment suspended: 123/s_456/acquisition-key=ABC

Copy the full environment key value for the next step.

11. Trigger a resume job

Replace the environment_key with the environment key from the runner logs:

resume_d = Ci::Workloads::WorkloadDefinition.new do |d|
  d.image = "dummy"
  d.commands = [
    'echo "${PWD}"',
    'cat "${PWD}/hello.log"'
  ]
  d.tags = ["vtak-local-suspendable-environments"]
  d.add_variable("FF_SUSPENDABLE_ENVIRONMENTS", "true")
  d.environment_key = 'UPDATE_ME'
end

resume_workload = Ci::Workloads::RunWorkloadService.new(
  project: project, current_user: user,
  source: :duo_workflow, workload_definition: resume_d
).execute

12. Verify build options

resume_pipeline_id = resume_workload.payload.pipeline_id
resume_build = Ci::Pipeline.find(resume_pipeline_id).builds.first

resume_build.options
# Should include :environment_key => "..."

presenter = Ci::BuildRunnerPresenter.new(resume_build)
presenter.suspend_options
# Should return { suspend_on_success: false, suspend_on_failure: false, environment_key: "X5eakR52n/..." }

Screenshots

NOTE

Since the runner config has capacity_per_instance set to 2, the first 2 jobs are scheduled on the same instance. The third job is scheduled on a different instance.
The resumed environment is always on the same instance on which it was suspended on. It inherits the entire filesystem of the suspended job as well.