Skip to content

Draft: Fix the order of subsequent jobs when requeuing a job (in DB)

Furkan Ayhan requested to merge 341561-order-of-processed-jobs-with-sql into master

What does this MR do and why?

This MR is just an alternative PoC for #341561 (closed).

In other MR (!74394 (closed)), we tried solving the ordering problem on the Ruby side, in this, on the DB side.

Copied from the other MR:

When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx but after introducing same-stage jobs, this disappeared.

In this MR, we are ordering jobs by stage first, then their DAG relationships.

We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))

This bug fix is behind a FF ci_fix_order_of_subsequent_jobs #345587 (closed).

DB

New SQL query;

SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN (
  SELECT id, MAX(depth) AS depth
  FROM (
    WITH RECURSIVE "base_and_descendants" AS (
      (SELECT 1 AS depth, ARRAY[ci_builds.id] AS tree_path, false AS tree_cycle, "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."id" = 1670888824)
      UNION
      (SELECT ("base_and_descendants"."depth" + 1), tree_path || "ci_builds".id, "ci_builds".id = ANY(tree_path), "ci_builds".*
       FROM "ci_builds", "base_and_descendants", "ci_build_needs"
       WHERE "ci_builds"."commit_id" = "base_and_descendants"."commit_id"
         AND "ci_build_needs"."build_id" = "ci_builds"."id"
         AND "ci_build_needs"."name" = "base_and_descendants"."name"
         AND "base_and_descendants"."tree_cycle" = FALSE)
    )
    SELECT ci_builds.id, MAX(depth) AS depth
    FROM "base_and_descendants" AS "ci_builds"
    WHERE "ci_builds"."id" NOT IN (SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."id" = 1670888824)
    GROUP BY "ci_builds"."id"

    UNION

    SELECT ci_builds.id, -1 AS depth FROM "ci_builds" WHERE "ci_builds"."commit_id" = 386704131 AND (stage_idx > 1)
  ) union_ci_builds
  GROUP BY id
) t2
ON ci_builds.id = t2.id
WHERE ("ci_builds"."status" IN ('skipped'))
ORDER BY stage_idx ASC, depth ASC, id ASC;

How to set up and validate locally

Bug description:

Example CI config:

stages:
  - A
  - B
  - C

A:
  stage: A
  script: exit 0
  when: manual

B:
  stage: B
  needs: ["A"]
  script: exit 0

C1:
  stage: C
  needs: ["B"]
  script: exit 0

C2:
  stage: C
  needs: ["B"]
  script: exit 0

C3:
  stage: C
  needs: ["B"]
  script: exit 0

Initial status:

Screen_Shot_2021-11-05_at_13.58.06

When you play the job A:

Screen_Shot_2021-11-05_at_13.58.25

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb

(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
  job.process(current_user)
end
  1. Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
  2. When we process "C2", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
  3. In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
  4. PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "C2".
  5. When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  6. And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".

Screen_Shot_2021-11-05_at_13.59.03

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Furkan Ayhan

Merge request reports