Fix the order of subsequent jobs when requeuing a job
What does this MR do and why?
When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx
but after introducing same-stage jobs, this disappeared.
In this MR, we are ordering jobs by stage to fix this. This will not fix all problems because there may be same-stage relationships but it is enough to fix 99% of occurrences.
In !74394 (closed), we tried to solve it for all, but it has lots of changes.
Bug issue: #341561 (closed)
This fix is behind a feature flag ci_order_subsequent_jobs_by_stage
#349977 (closed).
Screenshots or screen recordings
Bug description:
Example CI config:
stages:
- A
- B
- C
A:
stage: A
script: exit 0
when: manual
B:
stage: B
needs: ["A"]
script: exit 0
C1:
stage: C
needs: ["B"]
script: exit 0
C2:
stage: C
needs: ["B"]
script: exit 0
C3:
stage: C
needs: ["B"]
script: exit 0
Initial status:
When you play the job A:
- In
Ci::AfterRequeueJobService
; dependent jobs are iterated and run theprocess
method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb
(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
job.process(current_user)
end
- Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
- When we process "C2", its status will be "created". And we enqueue a
PipelineProcessWorker
right after it. - In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
-
PipelineProcessWorker
callsCi::PipelineProcessing::AtomicProcessingService
andupdate_processable!
is called for "C2". - When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
- The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
- And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".
Database
Old SQL Queries;
SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
AND "ci_builds"."commit_id" = 383904039
AND ("ci_builds"."status" IN ('skipped'))
AND (stage_idx > 0);
-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28100
SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
AND "ci_builds"."commit_id" = 383904039
AND ("ci_builds"."status" IN ('skipped'))
AND "ci_builds"."scheduling_type" = 1
AND (
EXISTS (
SELECT 1
FROM "ci_build_needs"
WHERE (ci_builds.id=ci_build_needs.build_id) AND "ci_build_needs"."name" = 'trigger_job'
)
);
-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28102
New SQL Query;
SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" IN ('Ci::Processable', 'Ci::Build', 'Ci::Bridge')
AND "ci_builds"."commit_id" = 383904039
AND ("ci_builds"."status" IN ('skipped'))
AND (
stage_idx > 0
OR "ci_builds"."scheduling_type" = 1
AND (
EXISTS (
SELECT 1
FROM "ci_build_needs"
WHERE (ci_builds.id=ci_build_needs.build_id) AND "ci_build_needs"."name" = 'trigger_job'
)
)
)
ORDER BY "ci_builds"."stage_idx" ASC;
-- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7867/commands/28104
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.