Retry job if runner does not acknowledge it in a timely fashion
What does this MR do and why?
NOTE: This MR is a reinstatement of !204265 (merged) and !204709 (merged), which I believe were rolled back unnecessarily in !205960 (merged). The code paths in this MR should not touch Redis unless the FF is enabled and the runner supports the two_phase_job_commit feature.
NOTE: The first commit on this MR is a squash of !204265 (merged) + !204709 (merged), with the subsequent commits being the actual code that needs to be reviewed
This MR:
- adds a worker that will retry a job waiting for the runner acknowledgement (introduced in !204265 (merged)) if the acknowledgement doesn't arrive in a timely fashion. This worker will reschedule itself to 30 seconds in the future, if the waiting period hasn't yet elapsed for the associated build.
- adds a new
runner_provisioning_timeoutfailure reason to let users know the reason why the job failed - uses the existing retry logic for builds from
Ci::RetryJobServiceandGitlab::Ci::Build::AutoRetry, which provides a max limit on the number of the retries.
I plan to monitor the Redis cluster usage as we rollout the FF across .com, to see if the usage introduced here doesn't cause issues.
References
- Introduce new API for Runner to transition jobs... (#464048 - closed)
- CI job will indicate it is running, but I don't... (#341293 - closed)
- Update `PUT /jobs/:id` endpoint for 2-phase com... (!204265 - merged)
Screenshots or screen recordings
Jobs started per 2 minute block (should give us a rough idea of how many Redis keys we'll have in memory - < 5K):
How to set up and validate locally
Hacky patch used for the gitlab-runner repo to simulate the calls with the new API argument values
diff --git a/commands/multi.go b/commands/multi.go
index 33679afaf..547865a5c 100644
--- a/commands/multi.go
+++ b/commands/multi.go
@@ -1071,6 +1071,13 @@ func (mr *RunCommand) requestJob(
Token: jobData.Token,
}
+ jobInfo := common.UpdateJobInfo{
+ ID: jobData.ID,
+ State: common.Running,
+ }
+
+ mr.network.UpdateJob(*runner, jobCredentials, jobInfo)
+
trace, err := mr.network.ProcessJob(*runner, jobCredentials)
if err != nil {
jobInfo := common.UpdateJobInfo{
diff --git a/common/network.go b/common/network.go
index 12c604b32..96b1ed707 100644
--- a/common/network.go
+++ b/common/network.go
@@ -206,6 +206,7 @@ type FeaturesInfo struct {
ServiceExecutorOpts bool `json:"service_executor_opts"`
CancelGracefully bool `json:"cancel_gracefully"`
NativeStepsIntegration bool `json:"native_steps_integration"`
+ TwoPhaseJobCommit bool `json:"two_phase_job_commit"`
}
type ConfigInfo struct {
diff --git a/network/gitlab.go b/network/gitlab.go
index 090ec3840..607fa83cc 100644
--- a/network/gitlab.go
+++ b/network/gitlab.go
@@ -89,6 +89,7 @@ func (n *GitLabClient) getFeatures(features *common.FeaturesInfo) {
features.TraceSize = true
features.Cancelable = true
features.CancelGracefully = true
+ features.TwoPhaseJobCommit = true
}
func (n *GitLabClient) ExecutorSupportsNativeSteps(config common.RunnerConfig) bool {
-
Enable the FF:
Feature.enable(:allow_runner_job_acknowledgement) -
Apply the patch above to the gitlab-runner project
-
From the GoLand IDE, put a breakpoint in
commands/multi.go:1079- the call tomr.network.UpdateJob(*runner, jobCredentials, jobInfo). This will prevent the runner from informing GitLab that the runner has picked up the job, and allow us to test a timeout. -
Go to http://gdk.test:3000/gitlab-org and create a new project called
Playground -
Go to http://gdk.test:3000/gitlab-org/playground/-/settings/ci_cd#js-runners-settings and create a project runner with
two_phase_committag -
Register the runner with the
shellexecutor and run it -
Create a
.gitlab-ci.ymlfile in the project with the following contents:default: tags: [two_phase_commit] build1: stage: build script: - echo "Do your build here" -
Run a new pipeline and specify a new variable
TEST_VARwith valuetest -
In the GDK console, run:
# Get pending build build = Ci::Build.pending.last # Should be true build.waiting_for_runner_ack? # Write id of runner manager who the build is waiting for build.runner_manager_id_waiting_for_ack # Write TTL for build (counting down from 120) Gitlab::Redis::SharedState.with { |r| r.ttl(build.send(:runner_build_ack_queue_key)) } # Check that build has custom variable build.variables["TEST_VAR"] -
Let the build timeout after 2 minutes:
-
Remove the breakpoint in GoLand and resume execution
The retried job should now complete just fine.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

