Capstone issue: Introduce new API for Runner to transition jobs to 'running' explicitly
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problems
Problem 1
A job, once picked up by GitLab-Runner, is moved to the "running" state immediately, even if GitLab-Runner isn't yet executing the job, but doing prep work.
- Depending on the executor and configuration, finding capacity is done ad-hoc, meaning that even though we're in the
runningstate, we're actually looking for/provisioning an environment to execute the job in. - Even for executors that are configured to first have an environment ready before requesting a job there can still be some prep time required.
- This prep time is misleading, and in some setups, can be counted towards compute minutes.
Problem 2
Jobs are assigned to runners which go offline before starting work on the job, and so stay pending until timeout. There have been more and more reports of this as time goes by, including a severity2 bug and a recent incident being raised.
Smart routing
There's been some previous discussions about a Runner daemon/router in the past, but these relied on significant changes to GitLab to achieve.
By fixing the above problem a side effect is that we can seamlessly introduce a smart Runner router/daemon for distributing jobs more efficiently. The router would use the exact same API as GitLab-Runner does.
When a job is picked up by the router, the state transitions to waiting_for_runner_ack. At this point, finding a Runner to execute the job is delegated to the router. When a Runner is found and executes the job, it instructs GitLab to transition the job to running.
We go from this:
┌────► GitLab-Runner 1
│
GitLab ├────► GitLab-Runner 2
│
└────► GitLab-Runner 3
To this:
┌────► GitLab-Runner 1
│
GitLab ───► Smart router ├────► GitLab-Runner 2
│
└────► GitLab-Runner 3
This doesn't solve the whole spectrum of problems we have with the existing job queue mechanism, but helps for certain setups and customers.
GitLab distributing to Runners at scale is complicated.
GitLab distributing to routers instead frees up some of the responsibility, and this is where a router, with it now having its own queue of jobs, can probably be smarter.
Some scenarios it can help with:
- Rather than multiple runner managers asking the GitLab instance for jobs, only the daemon needs to. All the Runners are asking the daemon for a job. This can help with a large fleet of self-hosted Runners hitting GitLab.com and can tighten network permissions.
- Have customers implement what they deem as "fair scheduling" to their own fleet of Runners
Proposal
Introduce a new job state waiting_for_runner_ack and a two-phase commit workflow for job assignment that addresses both problems.
Two-phase commit workflow
When a runner with two-phase commit support requests a job:
-
Phase 1 - Job Assignment: The job is assigned to the runner and transitioned to
waiting_for_runner_ackstate- The job is removed from
ci_pending_buildsto prevent assignment to other runners - The job remains in this state while the runner performs preparation tasks (provisioning, environment setup, etc.)
- The runner can send keep-alive signals via
PUT /api/v4/jobs/:idwithstate=pendingto prevent timeout
- The job is removed from
-
Phase 2 - Job Acceptance: The runner signals readiness to execute the job
- Runner calls
PUT /api/v4/jobs/:idwithstate=runningto transition the job torunningstate - At this point, the job timer starts and actual execution begins
- The response must include updated job metadata (started_at timestamp, refreshed job token, etc.)
- If the runner cannot execute the job, it can decline by calling the same endpoint with appropriate failure state
- Runner calls
State machine changes
- Add new
waiting_for_runner_ackstate to the job state machine - State transitions:
-
pending→waiting_for_runner_ack: When runner requests job (job negotiation enabled) -
pending→running: When runner requests job (legacy workflow, unchanged) -
waiting_for_runner_ack→running: When runner confirms job acceptance -
waiting_for_runner_ack→failed: When runner declines job or timeout occurs
-
Runner feature detection
Runners declare support for job negotiation via the supports_job_negotiation feature flag in their capabilities:
{
"info": {
"features": {
"supports_job_negotiation": true
}
}
}
Timeout handling
- A background worker (
Ci::RetryStuckWaitingJobWorker) monitors jobs inwaiting_for_runner_ackstate - Jobs that don't receive acceptance within the timeout period are automatically retried or failed
- This prevents jobs from being stuck indefinitely if a runner goes offline during preparation
API response structure for job acceptance
When transitioning from waiting_for_runner_ack to running, the runner needs updated job metadata that is calculated at the moment of transition:
-
started_attimestamp (forCI_JOB_STARTED_ATenvironment variable) - Refreshed job token with correct expiration time (when using JWT)
- Potentially other runtime-calculated values
Implementation approach: Use the include_metadata request parameter to opt-in to receiving updated metadata in the response body from PUT /api/v4/jobs/:id. While started_at and token could be returned in response headers, a structured response is more extensible for future additions.
Example request:
PUT /api/v4/jobs/:id?state=running&include_metadata=true
Example response
{
"id": 1234,
"token": "<refreshed-job-token>",
"allow_git_fetch": true,
"job_info": {
"id": 1234,
"name": "test-job",
"stage": "test",
"project_id": 5678,
"project_name": "my-project"
},
"git_info": {
"repo_url": "https://gitlab.example.com/my-group/my-project.git",
"ref": "main",
"sha": "a1b2c3d4e5f6",
"before_sha": "0000000000000000000000000000000000000000",
"ref_type": "branch"
},
"runner_info": {
"timeout": 3600,
"runner_session_url": "https://gitlab.example.com/session"
},
"variables": [
{
"key": "CI_JOB_ID",
"value": "1234",
"public": true,
"masked": false
},
{
"key": "CI_JOB_STARTED_AT",
"value": "2025-12-03T19:07:34Z",
"public": true,
"masked": false
},
{
"key": "CI_COMMIT_SHA",
"value": "a1b2c3d4e5f6",
"public": true,
"masked": false
}
],
"steps": [
{
"name": "script",
"script": ["echo 'Running tests'", "npm test"],
"timeout": 3600,
"when": "on_success",
"allow_failure": false
}
],
"image": {
"name": "node:18",
"entrypoint": null
},
"services": [],
"artifacts": [],
"cache": [],
"credentials": [],
"dependencies": [],
"features": {
"trace_sections": true
}
}
Benefits
- Accurate job timing: Job duration only includes actual execution time, not preparation time
- Compute minute accuracy: Preparation time is not counted toward compute minutes
- Improved reliability: Jobs can be reassigned if runners go offline during preparation
- Backward compatibility: Legacy runners continue to work unchanged
- Foundation for smart routing: Enables future router/daemon implementations for more efficient job distribution
Implementation approach
- Use database state (
waiting_for_runner_ack) instead of Redis for tracking waiting jobs - Leverage existing state machine infrastructure for state transitions
- Enhance existing API endpoint (
PUT /api/v4/jobs/:id) with metadata response support - Recalculate runtime values (started_at, token) when transitioning to
running - Minimal database schema changes (only adding the new state enum value)
Known edge cases and considerations
- Job token lifecycle: Token is assigned during state machine transition to running and can only be used while job is running
- Multiple trace artifacts: If job can be put back on queue but traces are accepted immediately, need to handle multiple trace artifacts or drop job and retry
- Runner predefined variables: Need mechanism to send these variables when job transitions to running
- Job timeouts: Currently assigned from runner; may need to be assigned later when job transitions to running
- JWT token expiry: Timeout defines expiry time for JWT token sent with payload; may need new variables at transition point
- Stuck jobs worker: Will need updates to handle new state
-
Job metrics timestamps:
started_atshould reflect actual execution start time, not assignment time -
Runner-job relationship:
p_ci_runner_machine_buildstable tracks relationship between jobs and runners
Previous proposal (Redis-based approach)
A solution to this would be to introduce a new Runner feature where when a job is picked up by Runner the job would remain in the pending state.
Only when GitLab-Runner has finished preparation tasks and is ready to actually execute the job would it notify GitLab to transition the job to the running state.
With the new feature enabled:
- when
build.runner_id == nilthe job ispending(as it is today) - when
build.runner_id != nilthe job remainspending, but no other runner will be assigned (and trace data can be submitted). - Runner would then call a new API endpoint to transition the job state to
running.
See #464048 (comment 1942871742) for changes to the database that are anticipated. (UPDATE: this should be doable without database changes, by temporarily removing the entry from ci_pending_builds while GitLab waits for the runner to accept the job).
NOTE: Any changes to the flow should be reflected in the documentation, including (but not limited to):
Implementation plan
Current workflow (legacy runner or feature not enabled)
- Pending job is added to
ci_pending_builds(Ci::UpdateBuildQueueService#push). - Runners poll for jobs (
POST /api/v4/jobs/request→ Ci::RegisterJobService#execute). - Once the pending job is assigned to a specific runner,
Ci::Build.run!is called which causesci_builds.statusfield to be updated from:pendingto:runningand the state machine to callCi::UpdateBuildQueueService#popto remove the job fromci_pending_builds.
New workflow (runner with supports_job_negotiation feature enabled)
- Pending job is added to
ci_pending_builds(Ci::UpdateBuildQueueService#push). - Runners poll for jobs (
POST /api/v4/jobs/request→ Ci::RegisterJobService#execute). - Once the pending job is assigned to a runner with job negotiation support:
- The job is transitioned to
waiting_for_runner_ackstate viaCi::Build.acknowledge_runner! - The runner manager is associated with the build
- The state machine automatically calls
Ci::UpdateBuildQueueService#popto remove the job fromci_pending_builds - A background job (
Ci::RetryStuckWaitingJobWorker) is scheduled to handle timeout scenarios
- The job is transitioned to
- The runner can send keep-alive signals using the existing
PUT /api/v4/jobs/:idendpoint withstate=pendingto prevent timeout during preparation. - When the runner is ready to execute the job, it calls
PUT /api/v4/jobs/:id?state=running&include_metadata=true:- The job transitions from
waiting_for_runner_acktorunningviaCi::Build.run! - Runtime values are calculated:
started_attimestamp, refreshed job token (if JWT), etc. - A structured response is returned containing the updated metadata
- The runner updates its environment variables with the new values (e.g.,
CI_JOB_STARTED_AT) - Job execution begins
- The job transitions from
- If the runner cannot execute the job or times out:
- The job can be failed or retried via
Ci::RetryStuckWaitingJobWorker - The job can be reassigned to another runner if available
- The job can be failed or retried via
API changes
The existing PUT /api/v4/jobs/:id endpoint is enhanced with metadata response support:
-
state=pending: Keep-alive signal during preparation (returns200 OK) -
state=running: Transition fromwaiting_for_runner_acktorunning- Without
include_metadata=true: Legacy behavior (returns200 OK) - With
include_metadata=true: Returns full job response with updated metadata (see example above)
- Without
- Other states: Regular job completion handling (success, failed, etc.)
Database changes
- Add
waiting_for_runner_ackto thestatusenum inci_buildstable - No additional tables or columns required
- State transitions are handled by the existing state machine
Monitoring
New Prometheus metrics track the two-phase commit workflow:
-
gitlab_ci_queue_operations_total{operation="runner_assigned_waiting_for_ack"}: Jobs assigned towaiting_for_runner_ackstate -
gitlab_ci_queue_operations_total{operation="runner_assigned_run"}: Jobs assigned directly torunningstate (legacy) -
gitlab_ci_queue_operations_total{operation="runner_queue_timeout"}: Jobs that timed out while waiting for acknowledgment
Previous implementation plan (Redis-based approach)
New workflow (new runner with feature flag enabled info.features)
-
Pending job is added to
ci_pending_builds(Ci::UpdateBuildQueueService#push). -
Runners poll for jobs (
POST /jobs/request-> Ci::RegisterJobService#execute). -
Once the pending job is assigned to a specific runner,
Ci::RegisterJobServiceshould callCi::UpdateBuildQueueService#popto pop it fromci_pending_builds, so other runners are not assigned this job. -
A new REST API endpoint (e.g.
POST /api/v4/jobs/{job_id}/runner_provisioning) will allow the runner to inform the GitLab instance about 3 scenarios:status meaning handled by result pendingRunner manager is still preparing the runner. Rails app (could be Workhorse) This call will occur regularly as a keep-alive to the GitLab instance. If the GitLab instance doesn't hear from the runner for let's say 5 minutes, it can return the job to the queue and assign it to another runner as in a declinedscenario. Workhorse could handle this entirely if needed by simply setting at a Redis key (build:pending_runner_queue:#{build_id}), so we keep a list of pending jobs.acceptedThe job has been accepted and the runner has started execution. Rails app Transition the job to running by calling Ci::Build.run!. The state machine will callCi::UpdateBuildQueueService#popto pop the job fromci_pending_builds, even though it is no longer present there, so it'll just be a no-op.declinedThe job has been declined by the runner manager. Rails app Push the job again to ci_pending_buildsby callingCi::UpdateBuildQueueService#push. We'll need to clearrunner_id,runner_manager_id, andrunner_session_attributesfrom theci_buildsrecord. Perhaps better to cancel job, and retry it. We may need a mechanism to avoid defaulting to the same runner when choosing a fallback runner for a declined job, if there's a choice for other runners. -
A cronjob worker (perhaps
StuckCiJobsWorker?) goes over Redis keys representing the pending jobs, and declines all jobs that have timed out without hearing further from the runner (any jobs matching Redis'build:pending_runner_queue:*that are too old).
Mermaid diagram
Adaptation of Long polling workflow:
sequenceDiagram
accTitle: Long polling workflow
accDescr: The flow of a single runner getting a job with long polling enabled
autonumber
participant C as Runner
participant W as Workhorse
participant Redis as Redis
participant R as Rails
participant S as Sidekiq
C->>+W: POST /api/v4/jobs/request
W->>+Redis: New job for runner A?
Redis->>+W: Unknown
W->>+R: POST /api/v4/jobs/request
R->>+Redis: Runner A: last_update = X
R->>W: 204 No job, X-GitLab-Last-Update = X
W->>C: 204 No job, X-GitLab-Last-Update = X
C->>W: POST /api/v4/jobs/request, X-GitLab-Last-Update: X
W->>Redis: Notify when last_update change
Note over W: Request held in long poll
Note over S: CI job created (ci_pending_builds)
Note over S, Redis: Update all registered runners
S->>Redis: Runner A: last_update = Z
Redis->>W: Runner: last_update changed
Note over W: Request released from long poll
W->>Rails: POST /api/v4/jobs/request
Note over R: Job removed from ci_pending_builds
Rails->>W: 201 Job was scheduled
W->>C: 201 Job was scheduled
loop Every 5 minutes
C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=pending
Note over R: Redis build:pending_runner_queue:{job_id} value updated
R->>+C: 200 OK
end
C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=accepted
Note over R: Job transitioned to running and<br>Redis build:pending_runner_queue:{job_id} value deleted
R->>+C: 200 OK