Fix the order of subsequent jobs when requeuing a job

Merged Furkan Ayhan requested to merge 341561-order-by-stage into master

What does this MR do and why?

When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx but after introducing same-stage jobs, this disappeared.

In this MR, we are ordering jobs by stage to fix this. This will not fix all problems because there may be same-stage relationships but it is enough to fix 99% of occurrences.

In !74394 (closed), we tried to solve it for all, but it has lots of changes.

Bug issue: #341561 (closed)

This fix is behind a feature flag ci_order_subsequent_jobs_by_stage #349977 (closed).

Screenshots or screen recordings

Bug description:

Example CI config:

stages:
  - A
  - B
  - C

A:
  stage: A
  script: exit 0
  when: manual

B:
  stage: B
  needs: ["A"]
  script: exit 0

C1:
  stage: C
  needs: ["B"]
  script: exit 0

C2:
  stage: C
  needs: ["B"]
  script: exit 0

C3:
  stage: C
  needs: ["B"]
  script: exit 0

Initial status:

Screen_Shot_2021-11-05_at_13.58.06

When you play the job A:

Screen_Shot_2021-11-05_at_13.58.25

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb

(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
  job.process(current_user)
end
  1. Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
  2. When we process "C2", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
  3. In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
  4. PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "C2".
  5. When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  6. And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".

Screen_Shot_2021-11-05_at_13.59.03

Database

Old SQL Queries;

SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND (stage_idx > 0);

-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28100

SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND "ci_builds"."scheduling_type" = 1
  AND (
    EXISTS (
      SELECT 1
      FROM "ci_build_needs"
      WHERE (ci_builds.id=ci_build_needs.build_id) AND "ci_build_needs"."name" = 'trigger_job'
    )
  );

-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28102

New SQL Query;

SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND (
    stage_idx > 0
    OR "ci_builds"."scheduling_type" = 1
    AND (
      EXISTS (
        SELECT 1
        FROM "ci_build_needs"
        WHERE (ci_builds.id=ci_build_needs.build_id) AND "ci_build_needs"."name" = 'trigger_job'
      )
    )
  )
ORDER BY "ci_builds"."stage_idx" ASC;

-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28104

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Furkan Ayhan