Skip to content

Fix DAG order of subsequent jobs after requeue

Furkan Ayhan requested to merge 341561-fix-dag-order into master

What does this MR do and why?

When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx but after introducing same-stage jobs, this disappeared.

We recently fixed this problem for 99% of occurrences: !77528 (merged), however, the problem still exists for the same-stage pipelines. Example use-case: #217129 (comment 826686713)

In this MR, we are ordering jobs by stage first, then their DAG relationships.

We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))

This bug fix is behind a FF ci_fix_order_of_subsequent_jobs #345587 (closed).

Previous attempt: !74394 (closed)

Screenshots or screen recordings

Example CI config:

stages: [a, b, c]

a1:
  stage: a
  script: exit $(($RANDOM % 2))

a2:
  stage: a
  script: exit 0
  needs: [a1]

b1:
  stage: b
  script: exit 0
  needs: [b2]

b2:
  stage: b
  script: exit 0

c1:
  stage: c
  script: exit 0
  needs: [b2]

c2:
  stage: c
  script: exit 0

Initial status:

Screen_Shot_2022-02-21_at_12.18.10

When you retry a1:

Screen_Shot_2022-02-21_at_13.01.24

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
  2. Those jobs are in random order if they are in the same stage.
  3. When we retry "a1...", Ci::AfterRequeueJobService sets subsequent jobs as "created" one by one.
  4. After setting a job as "created", we enqueue a PipelineProcessWorker.
  5. In our race-condition case, that worker is started to be processed before even iterating other jobs.
  6. Summary; retry a1, set b1, PipelineProcessWorker runs, then set b2.
  7. When we calculate the status of b1 in AtomicProcessing, we use the status of b2, which is "skipped" at the moment. So, we mark b1 as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  8. And b1 stays in the "skipped" state forever, even after processing b2.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #341561 (closed)

Edited by Furkan Ayhan

Merge request reports