Skip to content

Update PUT /jobs/:id endpoint for 2-phase commit of pending builds

What does this MR do and why?

This MR updates the PUT /jobs/:id REST endpoint so that it supports 2-phase commit of pending builds, allowing the runner to tell GitLab if it actually accepted the job or not. A follow-up MR will introduce a worker that will clean up jobs that have timed out without a runner acknowledging the job, so that we can fix #341293 (closed).

The logic behind this MR is protected behind a new allow_runner_job_acknowledgement FF, given the risk of the changes involved. The scope of the FF is the top-level namespace.

Details

This merge request implements a "two-phase commit" feature for GitLab's CI/CD job system to improve job timing accuracy and reliability.

Currently, when a runner requests a job, the job immediately switches to "running" status even though the runner might still be setting up the environment or downloading dependencies. This causes preparation time to be incorrectly counted as execution time.

The new system works in two phases: First, when a runner requests a job, the job stays in "pending" status while the runner prepares. Second, once the runner is actually ready to execute, it sends a signal to transition the job to "running" status. This ensures only actual execution time is counted.

The feature is backward compatible - older runners continue working as before, while newer runners can opt into the improved workflow by declaring they support "two-phase commit" in their capabilities. Jobs waiting for runner acknowledgment are temporarily stored in Redis cache instead of the database queue to prevent them from being assigned to other runners.

The changes include new API responses (like HTTP 409 for conflicts), validation logic to handle the new workflow, and comprehensive documentation explaining the feature. This improvement will make job timing more accurate and reduce issues with jobs getting stuck when runners go offline during preparation.

Changelog: added

References

How to set up and validate locally

Since we don't yet have a runner that supports the required feature, the only test we can do for the moment is ensuring that the legacy workflow still works (creating a job and ensuring the runner processes it normally).

We can build a runner that reports supporting the feature with the following changes:

gitlab-runner patch
diff --git a/common/network.go b/common/network.go
index 12c604b32..96b1ed707 100644
--- a/common/network.go
+++ b/common/network.go
@@ -206,6 +206,7 @@ type FeaturesInfo struct {
 	ServiceExecutorOpts     bool `json:"service_executor_opts"`
 	CancelGracefully        bool `json:"cancel_gracefully"`
 	NativeStepsIntegration  bool `json:"native_steps_integration"`
+	TwoPhaseJobCommit       bool `json:"two_phase_job_commit"`
 }
 
 type ConfigInfo struct {
diff --git a/network/gitlab.go b/network/gitlab.go
index 090ec3840..607fa83cc 100644
--- a/network/gitlab.go
+++ b/network/gitlab.go
@@ -89,6 +89,7 @@ func (n *GitLabClient) getFeatures(features *common.FeaturesInfo) {
 	features.TraceSize = true
 	features.Cancelable = true
 	features.CancelGracefully = true
+	features.TwoPhaseJobCommit = true
 }
 
 func (n *GitLabClient) ExecutorSupportsNativeSteps(config common.RunnerConfig) bool {

Then we need to enable the FF:

Feature.enable(:allow_runner_job_acknowledgement)

A full end-to-end test has been done in the follow-up MR.

New/updated documentation pages:

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Merge request reports

Loading