Fix job state transition hooks not taken into account when using two-phase commit

Based on the findings in note 2905078452, the allow_runner_job_acknowledgement feature flag is causing job timeout metadata to be unavailable or incorrectly set to 0s when jobs are in the pending state during Phase 1 of the two-phase commit workflow.

Impact: Kubernetes executor jobs using FF_USE_POD_ACTIVE_DEADLINE_SECONDS fail immediately with runner_system_failure because activeDeadlineSeconds is set to 0s instead of the configured job timeout.

Related

Required Fixes

1. Ensure Complete Job Metadata in Pending State

Problem: Job timeout (and potentially other metadata) is not fully populated in the job payload when the job is assigned to a runner in pending state.

Fix: Modify the job assignment logic to ensure all critical job metadata is included in the response to POST /api/v4/jobs/request even when the job remains in pending state.

Files to investigate:

  • lib/api/ci/runner.rb - Job request endpoint
  • app/services/ci/register_job_service.rb - Job assignment service
  • Job serializer used for runner API responses

2. Add Validation for Job Payload Completeness

Add validation to ensure the job payload sent to runners includes:

  • Job timeout (timeout)
  • Resource limits
  • All required variables
  • Any other metadata runners need during preparation phase

3. Add Test Coverage

Required tests:

  • Integration test: Kubernetes executor with FF_USE_POD_ACTIVE_DEADLINE_SECONDS + two-phase commit
  • Unit test: Job payload includes timeout when job is in pending state
  • E2E test: Verify activeDeadlineSeconds is set correctly for jobs using two-phase commit

4. Investigate Other Potential Metadata Issues

Review whether other job attributes might have similar issues:

  • Resource limits (CPU, memory)
  • Service containers configuration
  • Cache/artifact settings
  • Custom variables that might be lazily loaded

Verification Steps Before Next Rollout

Before re-enabling the feature flag:

  1. Confirm job timeout is present in API response for pending state jobs
  2. Test with Kubernetes executor using FF_USE_POD_ACTIVE_DEADLINE_SECONDS
  3. Verify activeDeadlineSeconds is set to correct timeout value (not 0s)
  4. Monitor for runner_system_failure errors during staged rollout /copy_metadata #578881
Edited by 🤖 GitLab Bot 🤖