Draft: Introduce waiting_for_runner_ack job status

What does this MR do and why?

This MR introduces a formal waiting_for_runner_ack state to the CI job state machine, replacing the previous Redis-based tracking approach for the two-phase commit feature.

References

#464048

Context

Problem

The two-phase commit feature (behind allow_runner_job_acknowledgement feature flag) previously used a semi-formal state tracked in Redis between pending and running. This approach had several issues:

  1. Race conditions: Multiple runners could be assigned the same job due to timing issues with Redis updates
  2. Missing job metadata: State transition hooks (like update_timeout_state) didn't execute because jobs stayed in pending state
  3. Increased complexity: Redis key management added complexity and potential for inconsistencies
  4. Maintainability issues: Future developers might add transition hooks without considering the "waiting" phase

Solution

This MR adds waiting_for_runner_ack as a formal state in the job state machine:

Previous flow (with Redis):

pending (with Redis flag) → running

New flow (with formal state):

pending → waiting_for_runner_ack → running

Benefits

  1. Eliminates race conditions - Database transactions and optimistic locking prevent concurrent assignments
  2. Ensures complete metadata - All before_transition and after_transition hooks execute properly
  3. Reduces complexity - No Redis key management needed
  4. Improves maintainability - Developers naturally add logic to the right transition hooks
  5. Better observability - State is visible in database queries, API, GraphQL, and monitoring

Implementation Details

Core Changes

  1. State Machine (app/models/commit_status.rb)

    • Added waiting_for_runner_ack to AVAILABLE_STATUSES
    • Added acknowledge_runner event: pendingwaiting_for_runner_ack
    • Updated run event to accept transitions from waiting_for_runner_ack
  2. Service Layer

    • RegisterJobService: Calls build.acknowledge_runner! for two-phase commit
    • UpdateBuildStateService: Handles transitions from waiting_for_runner_ack to running/failed
    • RetryWaitingJobService: Uses database timestamps instead of Redis
  3. Status Classes

    • Gitlab::Ci::Status::WaitingForRunnerAck - Core status class
    • Gitlab::Ci::Status::Build::WaitingForRunnerAck - Extended status class
    • Icon: status_pending, Label: "waiting for runner acknowledgment"
  4. Metrics

    • Renamed runner_assigned_waitingrunner_assigned_waiting_for_ack for clarity
    • Tracks jobs entering waiting_for_runner_ack state
    • Tracks timeouts via runner_queue_timeout metric

Redis Code Removal

All Redis-based tracking code has been removed:

  • Deleted lib/gitlab/ci/build/runner_ack_queue.rb
  • Removed Redis delegates from Ci::Build
  • Removed runner_ack_wait_status method
  • Updated all specs to use formal state

API & GraphQL

  • API automatically supports the new status (uses AVAILABLE_STATUSES)
  • GraphQL enum auto-generates from AVAILABLE_STATUSES
  • Can filter jobs by waiting_for_runner_ack status
  • Status appears in all job responses

Documentation

  • Updated doc/ci/jobs/_index.md - Added to "Available job statuses"
  • Updated doc/api/jobs.md - Added to "Job status values"
  • Updated doc/development/cicd/two_phase_job_commit.md - Removed Redis references
  • Updated doc/development/cicd/_index.md - Corrected architecture overview
  • Updated doc/architecture/decisions/0XX_add_waiting_for_runner_ack_state.md - Marked as completed

Test Coverage

Comprehensive test coverage with 200+ examples, 0 failures:

  • State machine transitions: spec/models/ci/build_waiting_for_runner_ack_state_spec.rb (23 examples)
  • Service layer: spec/services/ci/update_build_state_service_waiting_for_runner_ack_spec.rb (22 examples)
  • Worker: spec/workers/ci/retry_stuck_waiting_job_worker_spec.rb (20 examples)
  • Queue services: spec/services/ci/update_build_queue_service_waiting_for_runner_ack_spec.rb (16 examples)
  • Status classes: spec/lib/gitlab/ci/status/build/waiting_for_runner_ack_spec.rb (8 examples)
  • Integration: spec/requests/api/ci/runner_job_confirmation_integration_spec.rb (11 examples)
  • Two-phase commit: spec/services/ci/register_job_service_two_phase_commit_spec.rb (43 examples)
  • Race conditions: Tested across multiple spec files (30+ examples)

Migration

A no-op migration documents the change:

  • Migration file: db/migrate/20251127000000_add_waiting_for_runner_ack_status.rb
  • No schema changes needed (status is a string enum)
  • Rollback plan documented in migration

References

How to set up and validate locally

  1. Enable the feature flag:

    Feature.enable(:allow_runner_job_acknowledgement)
  2. Create a pipeline with a job:

    test_job:
      script: echo "Hello"
  3. Use a runner with two-phase commit support:

    • Runner must send two_phase_job_commit: true in capabilities
    • Job will transition to waiting_for_runner_ack state
    • Runner sends keep-alive signals with PUT /jobs/:id (state=pending)
    • Runner accepts job with PUT /jobs/:id (state=running)
  4. Verify the status:

    • Check job status in UI: Should show "waiting for runner acknowledgment"
    • Query API: GET /api/v4/projects/:id/jobs - Status will be waiting_for_runner_ack
    • Check database: Ci::Build.where(status: 'waiting_for_runner_ack')
  5. Monitor metrics:

    gitlab_ci_queue_operations_total{operation="runner_assigned_waiting_for_ack"}

MR acceptance checklist

This MR has been evaluated against the MR acceptance checklist:

  • Quality: Comprehensive test coverage (200+ examples), all passing
  • Performance: No performance regression - uses existing database fields and indexes
  • Reliability: Eliminates race conditions through database transactions
  • Security: No security implications - uses existing authentication
  • Maintainability: Simplifies codebase by removing Redis dependency
  • Documentation: User and developer docs updated
  • Backward Compatibility: Fully backward compatible with legacy runners
  • Observability: Metrics and logging in place
Edited by Pedro Pombeiro

Merge request reports

Loading