Skip to content

Reset skipped jobs on new alive jobs during pipeline processing

Leaminn Ma requested to merge reset-reprocess-in-atomic-processing into master

What does this MR do and why?

As discovered in #388539 (closed), a race condition sometimes occurs during pipeline processing, which results in some dependent jobs remaining skipped after clicking the "Play all manual" button. A simplified visual workflow of what is happening can be found here.

In this MR, we fix this condition by re-running ResetSkippedJobsService on any jobs change from a "stopped" status to an "alive" status during pipeline processing. The status definitions can be found in app/models/concerns/ci/has_status.rb.

Feature Flag: ci_reset_skipped_jobs_in_atomic_processing

How to set up and validate locally

  1. The following steps present a scenario where there is a high likelihood of demonstrating the issue.

  2. First, check out this branch and amend the codebase with a couple sleeps. The goal is to ensure that AtomicProcessingService starts running just after playing manual-job-1 and before playing manual-job-2.

Add sleep(0.5) to app/services/ci/play_manual_stage_service.rb:15:

Screenshot_2023-05-12_at_11.16.21_AM

Add sleep(1) to app/services/ci/pipeline_processing/atomic_processing_service.rb:94:

Screenshot_2023-05-12_at_11.16.04_AM

**If the above sleep times don't reliably reproduce the error, try updating the times to 1.5 and 2 seconds <-- These are what worked on my local machine.

  1. Update your project's .gitlab-ci.yml file with the following contents:
stages:
  - build
  - test

manual-job-1:
  stage: build
  when: manual
  script: echo

manual-job-2:
  stage: build
  when: manual
  script: echo

job-1:
  stage: test
  needs: [manual-job-1, manual-job-2]
  script: echo
  1. Run the pipeline. Initially it should look like the following screenshot. (Note that the initial processing may take a few seconds longer given the sleeps we added.

Screenshot_2023-05-12_at_11.00.16_AM

  1. Now click the "Play all manual" button of the build stage. The following results:

image

In the above screenshot, observe that job-1 is in skipped status.

  1. Now enable the feature flag: Feature.enable(:ci_reset_skipped_jobs_in_atomic_processing)
  2. Repeat steps 3-4, and observe that the issue does not occur and job-1 goes to created status and eventually succeeds. Repeat the test several times to ensure the pipeline reliably succeeds.

Screenshot_2023-05-12_at_11.06.28_AM

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #388539 (closed)

Edited by Leaminn Ma

Merge request reports