Skip to content

Fix the order of subsequent jobs when requeuing a job

Furkan Ayhan requested to merge 341561-order-by-stage into master

What does this MR do and why?

When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx but after introducing same-stage jobs, this disappeared.

In this MR, we are ordering jobs by stage to fix this. This will not fix all problems because there may be same-stage relationships but it is enough to fix 99% of occurrences.

In !74394 (closed), we tried to solve it for all, but it has lots of changes.

Bug issue: #341561 (closed)

This fix is behind a feature flag ci_order_subsequent_jobs_by_stage #349977 (closed).

Screenshots or screen recordings

Bug description:

Example CI config:

stages:
  - A
  - B
  - C

A:
  stage: A
  script: exit 0
  when: manual

B:
  stage: B
  needs: ["A"]
  script: exit 0

C1:
  stage: C
  needs: ["B"]
  script: exit 0

C2:
  stage: C
  needs: ["B"]
  script: exit 0

C3:
  stage: C
  needs: ["B"]
  script: exit 0

Initial status:

Screen_Shot_2021-11-05_at_13.58.06

When you play the job A:

Screen_Shot_2021-11-05_at_13.58.25

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb

(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
  job.process(current_user)
end
  1. Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
  2. When we process "C2", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
  3. In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
  4. PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "C2".
  5. When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  6. And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".

Screen_Shot_2021-11-05_at_13.59.03

Database

Old SQL Queries;

SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND (stage_idx > 0);

-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28100

SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND "ci_builds"."scheduling_type" = 1
  AND (
    EXISTS (
      SELECT 1
      FROM "ci_build_needs"
      WHERE (ci_builds.id=ci_build_needs.build_id) AND "ci_build_needs"."name" = 'trigger_job'
    )
  );

-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28102

New SQL Query;

SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND (
    stage_idx > 0
    OR "ci_builds"."scheduling_type" = 1
    AND (
      EXISTS (
        SELECT 1
        FROM "ci_build_needs"
        WHERE (ci_builds.id=ci_build_needs.build_id) AND "ci_build_needs"."name" = 'trigger_job'
      )
    )
  )
ORDER BY "ci_builds"."stage_idx" ASC;

-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28104

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Furkan Ayhan

Merge request reports