Skip to content

Start pipeline in after_commit callback when retrying jobs

Stan Hu requested to merge sh-retry-job-start-pipeline-after-commit into master

What does this MR do and why?

This fixes an issue where concurrent runners would not pick up retried builds due to the runner tick value not being invalidated.

Previously when a job failed, Ci::RetryJobService would clone the failed job, add its own after_commit hooks, and immediately attempt to start the pipeline. However, starting a pipeline loads its own instances of Ci::Build, and for builds that are updated, the state changes add their own after_commit hooks.

For a given transaction, Rails only runs one after_commit hook for a specific model (see https://github.com/rails/rails/pull/45280), so the after_commit hooks performed when starting the build would be discarded. As a result, BuildQueueWorker was never executed for the build, causing the runner tick value to be left in a stale state. Other run_after_commit blocks were not executed as well, such as build hooks.

To avoid this double after_commit business, move the starting of the pipeline into the run_after_commit block of the cloned job. This slightly delays the starting of the pipeline and the job, but it also avoids starting a Sidekiq worker inside a transaction (#398229 (closed)).

Relates to #387775 (closed)

How to set up and validate locally

  1. Set up a runner with concurrent = 2.
  2. Set up a project with a .gitlab-ci.yml:
image: busybox:latest

stages:
  - test

test:
  stage: test
  script:
    - exit 2
  retry: 1
  1. Run a pipeline, watch it fail. The second job should remain in pending.
  2. Enable the feature flag: Feature.enable(:retry_job_start_pipeline_after_commit).
  3. Run the pipeline again, and the job should be retried.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Stan Hu

Merge request reports