Fix job state transition hooks not taken into account when using two-phase commit
Based on the findings in note 2905078452, the allow_runner_job_acknowledgement feature flag is causing job timeout metadata to be unavailable or incorrectly set to 0s when jobs are in the pending state during Phase 1 of the two-phase commit workflow.
Impact: Kubernetes executor jobs using FF_USE_POD_ACTIVE_DEADLINE_SECONDS fail immediately with runner_system_failure because activeDeadlineSeconds is set to 0s instead of the configured job timeout.
Related
Required Fixes
1. Ensure Complete Job Metadata in Pending State
Problem: Job timeout (and potentially other metadata) is not fully populated in the job payload when the job is assigned to a runner in pending state.
Fix: Modify the job assignment logic to ensure all critical job metadata is included in the response to POST /api/v4/jobs/request even when the job remains in pending state.
Files to investigate:
-
lib/api/ci/runner.rb- Job request endpoint -
app/services/ci/register_job_service.rb- Job assignment service - Job serializer used for runner API responses
2. Add Validation for Job Payload Completeness
Add validation to ensure the job payload sent to runners includes:
-
✅ Job timeout (timeout) -
✅ Resource limits -
✅ All required variables -
✅ Any other metadata runners need during preparation phase
3. Add Test Coverage
Required tests:
-
Integration test: Kubernetes executor with FF_USE_POD_ACTIVE_DEADLINE_SECONDS+ two-phase commit -
Unit test: Job payload includes timeout when job is in pendingstate -
E2E test: Verify activeDeadlineSecondsis set correctly for jobs using two-phase commit
4. Investigate Other Potential Metadata Issues
Review whether other job attributes might have similar issues:
- Resource limits (CPU, memory)
- Service containers configuration
- Cache/artifact settings
- Custom variables that might be lazily loaded
Verification Steps Before Next Rollout
Before re-enabling the feature flag:
-
Confirm job timeout is present in API response for pendingstate jobs -
Test with Kubernetes executor using FF_USE_POD_ACTIVE_DEADLINE_SECONDS -
Verify activeDeadlineSecondsis set to correct timeout value (not0s) -
Monitor for runner_system_failureerrors during staged rollout /copy_metadata #578881