Skip to content

Draft: Fix stuck skipped jobs after playing a manual job

Furkan Ayhan requested to merge 341561-skipped-after-running-manual into master

What does this MR do and why?

This MR fixes the problem when a manual job is played and some subsequent jobs are stuck at "skipped" status.

Related to #341561 (closed)

When playing a manual job, we enqueue PipelineProcessWorker in a few places:

  1. Ci::PlayBuildService for the played job (build.enqueue triggers after_transition)
  2. Ci::AfterRequeueJobService for each subsequent skipped job (processable.process triggers after_transition)

In this commit, we skip pipeline processing for all by using "skip_pipeline_processing", and enqueue a single PipelineProcessWorker manually.

Why?

With the following examples in the section below, you cannot always reproduce the bug in the production environment. This is a sign of a race-condition or concurrency problem. I tried those examples in my local environment and saw that PipelineProcessWorker was enqueued more than once redundantly, sometimes.


Example CI config:

stages:
  - A
  - B
  - C

A:
  stage: A
  script: exit 0
  when: manual

B:
  stage: B
  needs: ["A"]
  script: exit 0

C1:
  stage: C
  needs: ["B"]
  script: exit 0

C2:
  stage: C
  needs: ["B"]
  script: exit 0

C3:
  stage: C
  needs: ["B"]
  script: exit 0

Initial status:

Screen_Shot_2021-11-05_at_13.58.06

When you play the job A:

Screen_Shot_2021-11-05_at_13.58.25

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb

(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
  job.process(current_user)
end
  1. Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
  2. When we process "C2", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
  3. In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
  4. PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "C2".
  5. When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  6. And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".

Screen_Shot_2021-11-05_at_13.59.03

Screenshots or screen recordings / How to set up and validate locally

Example 1 of the bug: "Steps to reproduce" section of #341561 (closed)
diff --git a/app/services/ci/after_requeue_job_service.rb b/app/services/ci/after_requeue_job_service.rb
index 9101ae91967..742bd4e3e54 100644
--- a/app/services/ci/after_requeue_job_service.rb
+++ b/app/services/ci/after_requeue_job_service.rb
@@ -13,6 +13,7 @@ def process_subsequent_jobs(processable)
       (stage_dependent_jobs(processable) | needs_dependent_jobs(processable))
       .each do |processable|
         process(processable)
+        sleep 1
       end
     end

diff --git a/app/services/ci/play_build_service.rb b/app/services/ci/play_build_service.rb
index c1cf06a4631..ba0f3992ad9 100644
--- a/app/services/ci/play_build_service.rb
+++ b/app/services/ci/play_build_service.rb
@@ -8,6 +8,8 @@ def execute(build, job_variables_attributes = nil)
       # Try to enqueue the build, otherwise create a duplicate.
       #
       if build.enqueue
+        sleep 1
+
         build.tap do |build|
           build.update(user: current_user, job_variables_attributes: job_variables_attributes || [])
stages:
  - first
  - second
  - third

first-a:
  stage: first
  script: sleep 10
  when: manual

first-b:
  stage: first
  script: sleep 10
  when: manual

second-a:
  stage: second
  script: echo
  needs: [first-a]

second-b:
  stage: second
  script: echo
  needs: [first-b]

third-a:
  stage: third
  script: echo
  needs:
    - first-a
    - second-a
  when: manual

third-b:
  stage: third
  script: echo
  needs:
    - first-b
    - second-b
  when: manual

Click "Play All";

Screen_Shot_2021-10-25_at_22.24.13

You'll see this;

Screen_Shot_2021-10-25_at_23.14.59

Then this;

Screen_Shot_2021-10-25_at_23.19.23

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Furkan Ayhan

Merge request reports