Skip to content

Draft: Fix the order of subsequent same-stage jobs when requeuing a job

Furkan Ayhan requested to merge 341561-order-of-processed-jobs into master

What does this MR do and why?

When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx but after introducing same-stage jobs, this disappeared.

We recently fixed this problem for 99% of occurrences: !77528 (merged), however, the problem still exists for the same-stage pipelines. Example use-case: #217129 (comment 826686713)

In this MR, we are ordering jobs by stage first, then their DAG relationships.

We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))

  • This bug fix is behind a FF ci_fix_order_of_subsequent_jobs #345587 (closed).
  • This MR adds a new column processing_order to the ci_builds_metadata table.

Note: We have two options for this fix; 1. doing the sorting on the Ruby side, 2. doing the sorting on the DB side. I chose the first one for this MR. And, I'll create another MR for the second option. (just did: !74491 (closed))

DB

UP

== 20220201092453 AddProcessingOrderToCiBuildsMetadata: migrating =============
-- add_column(:ci_builds_metadata, :processing_order, :integer)
   -> 0.0024s
== 20220201092453 AddProcessingOrderToCiBuildsMetadata: migrated (0.0024s) ====

DOWN

== 20220201092453 AddProcessingOrderToCiBuildsMetadata: reverting =============
-- remove_column(:ci_builds_metadata, :processing_order, :integer)
   -> 0.0020s
== 20220201092453 AddProcessingOrderToCiBuildsMetadata: reverted (0.0035s) ====

New query:

SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN "ci_builds_metadata" ON "ci_builds_metadata"."build_id" = "ci_builds"."id"
WHERE "ci_builds"."type" IN ('Ci::Processable',
                             'Ci::Build',
                             'Ci::Bridge')
  AND "ci_builds"."commit_id" = 383904039
  AND ("ci_builds"."status" IN ('skipped'))
  AND (stage_idx > 0
       OR "ci_builds"."scheduling_type" = 1
       AND (EXISTS
              (SELECT 1
               FROM "ci_build_needs"
               WHERE (ci_builds.id=ci_build_needs.build_id)
                 AND "ci_build_needs"."name" = 'build1')))
ORDER BY "ci_builds_metadata"."processing_order" ASC;

Query performance

ALTER TABLE "ci_builds_metadata" ADD "processing_order" integer DEFAULT 0 NOT NULL;

cold: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7511/commands/26727
hot: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7511/commands/26728

Screenshots or screen recordings

Bug description:

Example CI config:

default:
  image: alpine

test1:
  script: exit 0
  when: manual

test2:
  needs: ["test1"]
  script: exit 0
  parallel: 20

test3:
  needs: ["test2"]
  script: exit 0
  parallel: 20

Initial status:

Screen_Shot_2022-02-01_at_14.36.52

When you play the job A:

Screen_Shot_2022-02-01_at_14.38.01

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/84ff263c6a740b5ddf0202166c1cc085a42184a6/app/services/ci/after_requeue_job_service.rb#L26-28

stage_dependent_jobs.or(needs_dependent_jobs).ordered_by_stage.each do |job|
  job.process(current_user)
end
  1. Those jobs are in random order if they are in the same stage. In our example, the order is: "test3..., test2...".
  2. When we process "test3...", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
  3. In our race-condition case, that worker is started to be processed before even iterating other jobs; "test2..."
  4. PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "test3...".
  5. When we calculate the status of "test3...", we use the status of "test2...", which is "skipped" at the moment. So, we mark "test3..." as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  6. And "test3..." stays in the "skipped" state forever, even after processing "test2...".

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Furkan Ayhan

Merge request reports