Capstone issue: Introduce new API for Runner to transition jobs to 'running' explicitly

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problems

Problem 1

A job, once picked up by GitLab-Runner, is moved to the "running" state immediately, even if GitLab-Runner isn't yet executing the job, but doing prep work.

  • Depending on the executor and configuration, finding capacity is done ad-hoc, meaning that even though we're in the running state, we're actually looking for/provisioning an environment to execute the job in.
  • Even for executors that are configured to first have an environment ready before requesting a job there can still be some prep time required.
  • This prep time is misleading, and in some setups, can be counted towards compute minutes.

Problem 2

Jobs are assigned to runners which go offline before starting work on the job, and so stay pending until timeout. There have been more and more reports of this as time goes by, including a severity2 bug and a recent incident being raised.

Smart routing

There's been some previous discussions about a Runner daemon/router in the past, but these relied on significant changes to GitLab to achieve.

By fixing the above problem a side effect is that we can seamlessly introduce a smart Runner router/daemon for distributing jobs more efficiently. The router would use the exact same API as GitLab-Runner does.

When a job is picked up by the router, the state transitions to waiting_for_runner_ack. At this point, finding a Runner to execute the job is delegated to the router. When a Runner is found and executes the job, it instructs GitLab to transition the job to running.

We go from this:

       ┌────► GitLab-Runner 1

GitLab ├────► GitLab-Runner 2

       └────► GitLab-Runner 3

To this:

                         ┌────► GitLab-Runner 1 

GitLab ───► Smart router ├────► GitLab-Runner 2 

                         └────► GitLab-Runner 3 

This doesn't solve the whole spectrum of problems we have with the existing job queue mechanism, but helps for certain setups and customers.

GitLab distributing to Runners at scale is complicated.

GitLab distributing to routers instead frees up some of the responsibility, and this is where a router, with it now having its own queue of jobs, can probably be smarter.

Some scenarios it can help with:

  • Rather than multiple runner managers asking the GitLab instance for jobs, only the daemon needs to. All the Runners are asking the daemon for a job. This can help with a large fleet of self-hosted Runners hitting GitLab.com and can tighten network permissions.
  • Have customers implement what they deem as "fair scheduling" to their own fleet of Runners

Proposal

Introduce a new job state waiting_for_runner_ack and a two-phase commit workflow for job assignment that addresses both problems.

Two-phase commit workflow

When a runner with two-phase commit support requests a job:

  1. Phase 1 - Job Assignment: The job is assigned to the runner and transitioned to waiting_for_runner_ack state

    • The job is removed from ci_pending_builds to prevent assignment to other runners
    • The job remains in this state while the runner performs preparation tasks (provisioning, environment setup, etc.)
    • The runner can send keep-alive signals via PUT /api/v4/jobs/:id with state=pending to prevent timeout
  2. Phase 2 - Job Acceptance: The runner signals readiness to execute the job

    • Runner calls PUT /api/v4/jobs/:id with state=running to transition the job to running state
    • At this point, the job timer starts and actual execution begins
    • The response must include updated job metadata (started_at timestamp, refreshed job token, etc.)
    • If the runner cannot execute the job, it can decline by calling the same endpoint with appropriate failure state

State machine changes

  • Add new waiting_for_runner_ack state to the job state machine
  • State transitions:
    • pendingwaiting_for_runner_ack: When runner requests job (job negotiation enabled)
    • pendingrunning: When runner requests job (legacy workflow, unchanged)
    • waiting_for_runner_ackrunning: When runner confirms job acceptance
    • waiting_for_runner_ackfailed: When runner declines job or timeout occurs

Runner feature detection

Runners declare support for job negotiation via the supports_job_negotiation feature flag in their capabilities:

{
  "info": {
    "features": {
      "supports_job_negotiation": true
    }
  }
}

Timeout handling

  • A background worker (Ci::RetryStuckWaitingJobWorker) monitors jobs in waiting_for_runner_ack state
  • Jobs that don't receive acceptance within the timeout period are automatically retried or failed
  • This prevents jobs from being stuck indefinitely if a runner goes offline during preparation

API response structure for job acceptance

When transitioning from waiting_for_runner_ack to running, the runner needs updated job metadata that is calculated at the moment of transition:

  • started_at timestamp (for CI_JOB_STARTED_AT environment variable)
  • Refreshed job token with correct expiration time (when using JWT)
  • Potentially other runtime-calculated values

Implementation approach: Use the include_metadata request parameter to opt-in to receiving updated metadata in the response body from PUT /api/v4/jobs/:id. While started_at and token could be returned in response headers, a structured response is more extensible for future additions.

Example request:

PUT /api/v4/jobs/:id?state=running&include_metadata=true
Example response
{
  "id": 1234,
  "token": "<refreshed-job-token>",
  "allow_git_fetch": true,
  "job_info": {
    "id": 1234,
    "name": "test-job",
    "stage": "test",
    "project_id": 5678,
    "project_name": "my-project"
  },
  "git_info": {
    "repo_url": "https://gitlab.example.com/my-group/my-project.git",
    "ref": "main",
    "sha": "a1b2c3d4e5f6",
    "before_sha": "0000000000000000000000000000000000000000",
    "ref_type": "branch"
  },
  "runner_info": {
    "timeout": 3600,
    "runner_session_url": "https://gitlab.example.com/session"
  },
  "variables": [
    {
      "key": "CI_JOB_ID",
      "value": "1234",
      "public": true,
      "masked": false
    },
    {
      "key": "CI_JOB_STARTED_AT",
      "value": "2025-12-03T19:07:34Z",
      "public": true,
      "masked": false
    },
    {
      "key": "CI_COMMIT_SHA",
      "value": "a1b2c3d4e5f6",
      "public": true,
      "masked": false
    }
  ],
  "steps": [
    {
      "name": "script",
      "script": ["echo 'Running tests'", "npm test"],
      "timeout": 3600,
      "when": "on_success",
      "allow_failure": false
    }
  ],
  "image": {
    "name": "node:18",
    "entrypoint": null
  },
  "services": [],
  "artifacts": [],
  "cache": [],
  "credentials": [],
  "dependencies": [],
  "features": {
    "trace_sections": true
  }
}

Benefits

  1. Accurate job timing: Job duration only includes actual execution time, not preparation time
  2. Compute minute accuracy: Preparation time is not counted toward compute minutes
  3. Improved reliability: Jobs can be reassigned if runners go offline during preparation
  4. Backward compatibility: Legacy runners continue to work unchanged
  5. Foundation for smart routing: Enables future router/daemon implementations for more efficient job distribution

Implementation approach

  • Use database state (waiting_for_runner_ack) instead of Redis for tracking waiting jobs
  • Leverage existing state machine infrastructure for state transitions
  • Enhance existing API endpoint (PUT /api/v4/jobs/:id) with metadata response support
  • Recalculate runtime values (started_at, token) when transitioning to running
  • Minimal database schema changes (only adding the new state enum value)

Known edge cases and considerations

  1. Job token lifecycle: Token is assigned during state machine transition to running and can only be used while job is running
  2. Multiple trace artifacts: If job can be put back on queue but traces are accepted immediately, need to handle multiple trace artifacts or drop job and retry
  3. Runner predefined variables: Need mechanism to send these variables when job transitions to running
  4. Job timeouts: Currently assigned from runner; may need to be assigned later when job transitions to running
  5. JWT token expiry: Timeout defines expiry time for JWT token sent with payload; may need new variables at transition point
  6. Stuck jobs worker: Will need updates to handle new state
  7. Job metrics timestamps: started_at should reflect actual execution start time, not assignment time
  8. Runner-job relationship: p_ci_runner_machine_builds table tracks relationship between jobs and runners
Previous proposal (Redis-based approach)

A solution to this would be to introduce a new Runner feature where when a job is picked up by Runner the job would remain in the pending state.

Only when GitLab-Runner has finished preparation tasks and is ready to actually execute the job would it notify GitLab to transition the job to the running state.

With the new feature enabled:

  • when build.runner_id == nil the job is pending (as it is today)
  • when build.runner_id != nil the job remains pending, but no other runner will be assigned (and trace data can be submitted).
  • Runner would then call a new API endpoint to transition the job state to running.

See #464048 (comment 1942871742) for changes to the database that are anticipated. (UPDATE: this should be doable without database changes, by temporarily removing the entry from ci_pending_builds while GitLab waits for the runner to accept the job).

NOTE: Any changes to the flow should be reflected in the documentation, including (but not limited to):

Implementation plan

Current workflow (legacy runner or feature not enabled)

  1. Pending job is added to ci_pending_builds (Ci::UpdateBuildQueueService#push).
  2. Runners poll for jobs (POST /api/v4/jobs/requestCi::RegisterJobService#execute).
  3. Once the pending job is assigned to a specific runner, Ci::Build.run! is called which causes ci_builds.status field to be updated from :pending to :running and the state machine to call Ci::UpdateBuildQueueService#pop to remove the job from ci_pending_builds.

New workflow (runner with supports_job_negotiation feature enabled)

  1. Pending job is added to ci_pending_builds (Ci::UpdateBuildQueueService#push).
  2. Runners poll for jobs (POST /api/v4/jobs/requestCi::RegisterJobService#execute).
  3. Once the pending job is assigned to a runner with job negotiation support:
    • The job is transitioned to waiting_for_runner_ack state via Ci::Build.acknowledge_runner!
    • The runner manager is associated with the build
    • The state machine automatically calls Ci::UpdateBuildQueueService#pop to remove the job from ci_pending_builds
    • A background job (Ci::RetryStuckWaitingJobWorker) is scheduled to handle timeout scenarios
  4. The runner can send keep-alive signals using the existing PUT /api/v4/jobs/:id endpoint with state=pending to prevent timeout during preparation.
  5. When the runner is ready to execute the job, it calls PUT /api/v4/jobs/:id?state=running&include_metadata=true:
    • The job transitions from waiting_for_runner_ack to running via Ci::Build.run!
    • Runtime values are calculated: started_at timestamp, refreshed job token (if JWT), etc.
    • A structured response is returned containing the updated metadata
    • The runner updates its environment variables with the new values (e.g., CI_JOB_STARTED_AT)
    • Job execution begins
  6. If the runner cannot execute the job or times out:
    • The job can be failed or retried via Ci::RetryStuckWaitingJobWorker
    • The job can be reassigned to another runner if available

API changes

The existing PUT /api/v4/jobs/:id endpoint is enhanced with metadata response support:

  • state=pending: Keep-alive signal during preparation (returns 200 OK)
  • state=running: Transition from waiting_for_runner_ack to running
    • Without include_metadata=true: Legacy behavior (returns 200 OK)
    • With include_metadata=true: Returns full job response with updated metadata (see example above)
  • Other states: Regular job completion handling (success, failed, etc.)

Database changes

  • Add waiting_for_runner_ack to the status enum in ci_builds table
  • No additional tables or columns required
  • State transitions are handled by the existing state machine

Monitoring

New Prometheus metrics track the two-phase commit workflow:

  • gitlab_ci_queue_operations_total{operation="runner_assigned_waiting_for_ack"}: Jobs assigned to waiting_for_runner_ack state
  • gitlab_ci_queue_operations_total{operation="runner_assigned_run"}: Jobs assigned directly to running state (legacy)
  • gitlab_ci_queue_operations_total{operation="runner_queue_timeout"}: Jobs that timed out while waiting for acknowledgment
Previous implementation plan (Redis-based approach)

New workflow (new runner with feature flag enabled info.features)

  1. Pending job is added to ci_pending_builds (Ci::UpdateBuildQueueService#push).

  2. Runners poll for jobs (POST /jobs/request -> Ci::RegisterJobService#execute).

  3. Once the pending job is assigned to a specific runner, Ci::RegisterJobService should call Ci::UpdateBuildQueueService#pop to pop it from ci_pending_builds, so other runners are not assigned this job.

  4. A new REST API endpoint (e.g. POST /api/v4/jobs/{job_id}/runner_provisioning) will allow the runner to inform the GitLab instance about 3 scenarios:

    status meaning handled by result
    pending Runner manager is still preparing the runner. Rails app (could be Workhorse) This call will occur regularly as a keep-alive to the GitLab instance. If the GitLab instance doesn't hear from the runner for let's say 5 minutes, it can return the job to the queue and assign it to another runner as in a declined scenario. Workhorse could handle this entirely if needed by simply setting at a Redis key (build:pending_runner_queue:#{build_id}), so we keep a list of pending jobs.
    accepted The job has been accepted and the runner has started execution. Rails app Transition the job to running by calling Ci::Build.run!. The state machine will call Ci::UpdateBuildQueueService#pop to pop the job from ci_pending_builds, even though it is no longer present there, so it'll just be a no-op.
    declined The job has been declined by the runner manager. Rails app Push the job again to ci_pending_builds by calling Ci::UpdateBuildQueueService#push. We'll need to clear runner_id, runner_manager_id, and runner_session_attributes from the ci_builds record. Perhaps better to cancel job, and retry it. We may need a mechanism to avoid defaulting to the same runner when choosing a fallback runner for a declined job, if there's a choice for other runners.
  5. A cronjob worker (perhaps StuckCiJobsWorker?) goes over Redis keys representing the pending jobs, and declines all jobs that have timed out without hearing further from the runner (any jobs matching Redis' build:pending_runner_queue:* that are too old).

Mermaid diagram

Adaptation of Long polling workflow:

sequenceDiagram
accTitle: Long polling workflow
accDescr: The flow of a single runner getting a job with long polling enabled

    autonumber
    participant C as Runner
    participant W as Workhorse
    participant Redis as Redis
    participant R as Rails
    participant S as Sidekiq
    C->>+W: POST /api/v4/jobs/request
    W->>+Redis: New job for runner A?
    Redis->>+W: Unknown
    W->>+R: POST /api/v4/jobs/request
    R->>+Redis: Runner A: last_update = X
    R->>W: 204 No job, X-GitLab-Last-Update = X
    W->>C: 204 No job, X-GitLab-Last-Update = X
    C->>W: POST /api/v4/jobs/request, X-GitLab-Last-Update: X
    W->>Redis: Notify when last_update change
    Note over W: Request held in long poll
    Note over S: CI job created (ci_pending_builds)
    Note over S, Redis: Update all registered runners
    S->>Redis: Runner A: last_update = Z
    Redis->>W: Runner: last_update changed
    Note over W: Request released from long poll
    W->>Rails: POST /api/v4/jobs/request
    Note over R: Job removed from ci_pending_builds
    Rails->>W: 201 Job was scheduled
    W->>C: 201 Job was scheduled
    loop Every 5 minutes
    C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=pending
    Note over R: Redis build:pending_runner_queue:{job_id} value updated
    R->>+C: 200 OK
    end
    C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=accepted
    Note over R: Job transitioned to running and<br>Redis build:pending_runner_queue:{job_id} value deleted
    R->>+C: 200 OK
Edited by 🤖 GitLab Bot 🤖