Skip to content

Retry job if runner does not acknowledge it in a timely fashion

What does this MR do and why?

NOTE: This MR is a reinstatement of !204265 (merged) and !204709 (merged), which I believe were rolled back unnecessarily in !205960 (merged). The code paths in this MR should not touch Redis unless the FF is enabled and the runner supports the two_phase_job_commit feature.

NOTE: The first commit on this MR is a squash of !204265 (merged) + !204709 (merged), with the subsequent commits being the actual code that needs to be reviewed

This MR:

  1. adds a worker that will retry a job waiting for the runner acknowledgement (introduced in !204265 (merged)) if the acknowledgement doesn't arrive in a timely fashion. This worker will reschedule itself to 30 seconds in the future, if the waiting period hasn't yet elapsed for the associated build.
  2. adds a new runner_provisioning_timeout failure reason to let users know the reason why the job failed
  3. uses the existing retry logic for builds from Ci::RetryJobService and Gitlab::Ci::Build::AutoRetry, which provides a max limit on the number of the retries.

I plan to monitor the Redis cluster usage as we rollout the FF across .com, to see if the usage introduced here doesn't cause issues.

References

Screenshots or screen recordings

Jobs started per 2 minute block (should give us a rough idea of how many Redis keys we'll have in memory - < 5K):

image

Grafana link

image

How to set up and validate locally

Hacky patch used for the gitlab-runner repo to simulate the calls with the new API argument values
diff --git a/commands/multi.go b/commands/multi.go
index 33679afaf..547865a5c 100644
--- a/commands/multi.go
+++ b/commands/multi.go
@@ -1071,6 +1071,13 @@ func (mr *RunCommand) requestJob(
 		Token: jobData.Token,
 	}
 
+	jobInfo := common.UpdateJobInfo{
+		ID:    jobData.ID,
+		State: common.Running,
+	}
+
+	mr.network.UpdateJob(*runner, jobCredentials, jobInfo)
+
 	trace, err := mr.network.ProcessJob(*runner, jobCredentials)
 	if err != nil {
 		jobInfo := common.UpdateJobInfo{
diff --git a/common/network.go b/common/network.go
index 12c604b32..96b1ed707 100644
--- a/common/network.go
+++ b/common/network.go
@@ -206,6 +206,7 @@ type FeaturesInfo struct {
 	ServiceExecutorOpts     bool `json:"service_executor_opts"`
 	CancelGracefully        bool `json:"cancel_gracefully"`
 	NativeStepsIntegration  bool `json:"native_steps_integration"`
+	TwoPhaseJobCommit       bool `json:"two_phase_job_commit"`
 }
 
 type ConfigInfo struct {
diff --git a/network/gitlab.go b/network/gitlab.go
index 090ec3840..607fa83cc 100644
--- a/network/gitlab.go
+++ b/network/gitlab.go
@@ -89,6 +89,7 @@ func (n *GitLabClient) getFeatures(features *common.FeaturesInfo) {
 	features.TraceSize = true
 	features.Cancelable = true
 	features.CancelGracefully = true
+	features.TwoPhaseJobCommit = true
 }
 
 func (n *GitLabClient) ExecutorSupportsNativeSteps(config common.RunnerConfig) bool {
  1. Enable the FF:

    Feature.enable(:allow_runner_job_acknowledgement)
  2. Apply the patch above to the gitlab-runner project

  3. From the GoLand IDE, put a breakpoint in commands/multi.go:1079 - the call to mr.network.UpdateJob(*runner, jobCredentials, jobInfo). This will prevent the runner from informing GitLab that the runner has picked up the job, and allow us to test a timeout.

  4. Go to http://gdk.test:3000/gitlab-org and create a new project called Playground

  5. Go to http://gdk.test:3000/gitlab-org/playground/-/settings/ci_cd#js-runners-settings and create a project runner with two_phase_commit tag

  6. Register the runner with the shell executor and run it

  7. Create a .gitlab-ci.yml file in the project with the following contents:

    default:
      tags: [two_phase_commit]
    
    build1:
      stage: build
      script:
        - echo "Do your build here"
  8. Run a new pipeline and specify a new variable TEST_VAR with value test

  9. In the GDK console, run:

    # Get pending build
    build = Ci::Build.pending.last
    # Should be true
    build.waiting_for_runner_ack?
    # Write id of runner manager who the build is waiting for
    build.runner_manager_id_waiting_for_ack
    # Write TTL for build (counting down from 120)
    Gitlab::Redis::SharedState.with { |r| r.ttl(build.send(:runner_build_ack_queue_key)) }
    # Check that build has custom variable
    build.variables["TEST_VAR"]
  10. Let the build timeout after 2 minutes:

    image

  11. Remove the breakpoint in GoLand and resume execution

The retried job should now complete just fine.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Pedro Pombeiro

Merge request reports

Loading