Draft: Introduce waiting_for_runner_ack job status
What does this MR do and why?
This MR introduces a formal waiting_for_runner_ack state to the CI job state machine, replacing the previous Redis-based tracking approach for the two-phase commit feature.
References
Context
Problem
The two-phase commit feature (behind allow_runner_job_acknowledgement feature flag) previously used a semi-formal state tracked in Redis between pending and running. This approach had several issues:
- Race conditions: Multiple runners could be assigned the same job due to timing issues with Redis updates
-
Missing job metadata: State transition hooks (like
update_timeout_state) didn't execute because jobs stayed inpendingstate - Increased complexity: Redis key management added complexity and potential for inconsistencies
- Maintainability issues: Future developers might add transition hooks without considering the "waiting" phase
Solution
This MR adds waiting_for_runner_ack as a formal state in the job state machine:
Previous flow (with Redis):
pending (with Redis flag) → running
New flow (with formal state):
pending → waiting_for_runner_ack → running
Benefits
-
✅ Eliminates race conditions - Database transactions and optimistic locking prevent concurrent assignments -
✅ Ensures complete metadata - Allbefore_transitionandafter_transitionhooks execute properly -
✅ Reduces complexity - No Redis key management needed -
✅ Improves maintainability - Developers naturally add logic to the right transition hooks -
✅ Better observability - State is visible in database queries, API, GraphQL, and monitoring
Implementation Details
Core Changes
-
State Machine (
app/models/commit_status.rb)- Added
waiting_for_runner_acktoAVAILABLE_STATUSES - Added
acknowledge_runnerevent:pending→waiting_for_runner_ack - Updated
runevent to accept transitions fromwaiting_for_runner_ack
- Added
-
Service Layer
-
RegisterJobService: Callsbuild.acknowledge_runner!for two-phase commit -
UpdateBuildStateService: Handles transitions fromwaiting_for_runner_acktorunning/failed -
RetryWaitingJobService: Uses database timestamps instead of Redis
-
-
Status Classes
-
Gitlab::Ci::Status::WaitingForRunnerAck- Core status class -
Gitlab::Ci::Status::Build::WaitingForRunnerAck- Extended status class - Icon:
status_pending, Label: "waiting for runner acknowledgment"
-
-
Metrics
- Renamed
runner_assigned_waiting→runner_assigned_waiting_for_ackfor clarity - Tracks jobs entering
waiting_for_runner_ackstate - Tracks timeouts via
runner_queue_timeoutmetric
- Renamed
Redis Code Removal
All Redis-based tracking code has been removed:
-
✅ Deletedlib/gitlab/ci/build/runner_ack_queue.rb -
✅ Removed Redis delegates fromCi::Build -
✅ Removedrunner_ack_wait_statusmethod -
✅ Updated all specs to use formal state
API & GraphQL
-
✅ API automatically supports the new status (usesAVAILABLE_STATUSES) -
✅ GraphQL enum auto-generates fromAVAILABLE_STATUSES -
✅ Can filter jobs bywaiting_for_runner_ackstatus -
✅ Status appears in all job responses
Documentation
-
✅ Updateddoc/ci/jobs/_index.md- Added to "Available job statuses" -
✅ Updateddoc/api/jobs.md- Added to "Job status values" -
✅ Updateddoc/development/cicd/two_phase_job_commit.md- Removed Redis references -
✅ Updateddoc/development/cicd/_index.md- Corrected architecture overview -
✅ Updateddoc/architecture/decisions/0XX_add_waiting_for_runner_ack_state.md- Marked as completed
Test Coverage
Comprehensive test coverage with 200+ examples, 0 failures:
-
✅ State machine transitions:spec/models/ci/build_waiting_for_runner_ack_state_spec.rb(23 examples) -
✅ Service layer:spec/services/ci/update_build_state_service_waiting_for_runner_ack_spec.rb(22 examples) -
✅ Worker:spec/workers/ci/retry_stuck_waiting_job_worker_spec.rb(20 examples) -
✅ Queue services:spec/services/ci/update_build_queue_service_waiting_for_runner_ack_spec.rb(16 examples) -
✅ Status classes:spec/lib/gitlab/ci/status/build/waiting_for_runner_ack_spec.rb(8 examples) -
✅ Integration:spec/requests/api/ci/runner_job_confirmation_integration_spec.rb(11 examples) -
✅ Two-phase commit:spec/services/ci/register_job_service_two_phase_commit_spec.rb(43 examples) -
✅ Race conditions: Tested across multiple spec files (30+ examples)
Migration
A no-op migration documents the change:
- Migration file:
db/migrate/20251127000000_add_waiting_for_runner_ack_status.rb - No schema changes needed (status is a string enum)
- Rollback plan documented in migration
References
- Related to #578881 (closed) (Race condition)
- Related to #581720 (closed) (Missing transition hooks)
- Related to #568905 (Feature flag rollout)
- Related to #341293 (Two-phase commit feature)
How to set up and validate locally
-
Enable the feature flag:
Feature.enable(:allow_runner_job_acknowledgement) -
Create a pipeline with a job:
test_job: script: echo "Hello" -
Use a runner with two-phase commit support:
- Runner must send
two_phase_job_commit: truein capabilities - Job will transition to
waiting_for_runner_ackstate - Runner sends keep-alive signals with
PUT /jobs/:id(state=pending) - Runner accepts job with
PUT /jobs/:id(state=running)
- Runner must send
-
Verify the status:
- Check job status in UI: Should show "waiting for runner acknowledgment"
- Query API:
GET /api/v4/projects/:id/jobs- Status will bewaiting_for_runner_ack - Check database:
Ci::Build.where(status: 'waiting_for_runner_ack')
-
Monitor metrics:
gitlab_ci_queue_operations_total{operation="runner_assigned_waiting_for_ack"}
MR acceptance checklist
This MR has been evaluated against the MR acceptance checklist:
-
✅ Quality: Comprehensive test coverage (200+ examples), all passing -
✅ Performance: No performance regression - uses existing database fields and indexes -
✅ Reliability: Eliminates race conditions through database transactions -
✅ Security: No security implications - uses existing authentication -
✅ Maintainability: Simplifies codebase by removing Redis dependency -
✅ Documentation: User and developer docs updated -
✅ Backward Compatibility: Fully backward compatible with legacy runners -
✅ Observability: Metrics and logging in place