Fix job state transition hooks not taken into account when using two-phase commit
Based on the findings in note 2905078452, the allow_runner_job_acknowledgement feature flag is causing job timeout metadata to be unavailable or incorrectly set to 0s when jobs are in the pending state during Phase 1 of the two-phase commit workflow.
Impact: Kubernetes executor jobs using FF_USE_POD_ACTIVE_DEADLINE_SECONDS fail immediately with runner_system_failure because activeDeadlineSeconds is set to 0s instead of the configured job timeout.
Related
Required Fixes
1. Ensure Complete Job Metadata in Pending State
Problem: Job timeout (and potentially other metadata) is not fully populated in the job payload when the job is assigned to a runner in pending state.
Fix: Modify the job assignment logic to ensure all critical job metadata is included in the response to POST /api/v4/jobs/request even when the job remains in pending state.
Files to investigate:
-
lib/api/ci/runner.rb- Job request endpoint -
app/services/ci/register_job_service.rb- Job assignment service - Job serializer used for runner API responses
2. Add Validation for Job Payload Completeness and other Job Transition side effects
Add validation to ensure the job payload sent to runners includes:
-
✅ Job timeout (timeout) -
✅ Resource limits -
✅ All required variables -
✅ Any other metadata runners need during preparation phase
3. Add Test Coverage
Required tests:
-
Integration test: Kubernetes executor with FF_USE_POD_ACTIVE_DEADLINE_SECONDS+ two-phase commit -
Unit test: Job payload includes timeout when job is in pendingstate -
E2E test: Verify activeDeadlineSecondsis set correctly for jobs using two-phase commit
4. Investigate Other Potential Metadata Issues
Review whether other job attributes might have similar issues:
- Resource limits (CPU, memory)
- Service containers configuration
- Cache/artifact settings
- Custom variables that might be lazily loaded
Verification Steps Before Next Rollout
Before re-enabling the feature flag:
-
Confirm job timeout is present in API response for pendingstate jobs -
Test with Kubernetes executor using FF_USE_POD_ACTIVE_DEADLINE_SECONDS -
Verify activeDeadlineSecondsis set to correct timeout value (not0s) -
Monitor for runner_system_failureerrors during staged rollout /copy_metadata #578881 (closed)