Use until_executing when deduplicating the "assign resource worker"
What does this MR do and why?
Jobs with a resource group sometimes gets stuck "waiting for resource" due to a race condition. (See: https://docs.gitlab.com/ee/ci/resource_groups/#race-conditions-in-complex-or-busy-pipelines)
This problem is due to the fact that the AssignResourceFromResourceGroupWorker
, which allocates a job to a resource group, is deduplicated
with an until_executed
strategy. This means that if a job for a resource group is still running, a newly queued job for the same resource group gets dropped.
We came up with different solutions to resolve this, and settled on "re-spawning" the worker when certain conditions are met. However, upon verifying in production, we observed that !147313 (merged) did not really fix the problem because the "re-spawned" job also runs into race conditions (see #436988 (comment 1856609263)).
The change
This current MR tackles the actual cause of the problem, which is: jobs being dropped if another job for the same resource group is RUNNING. Here, we change the deduplication strategy to until_executing
, which means that jobs will be dropped if another job for the same resource group is QUEUED; if the job is already running, new jobs can be queued. I believe that this change, in combination with the first fix, will prevent the possibility of jobs getting stuck at "waiting for resource".
Caveats and considerations
This issue is impossible to replicate locally, so it is very hard to verify the actual effectiveness of the fix.
- This is instead introduced behind a feature flag, which will be enabled for example projects in production, where I will test the changes
- It's not possible to switch deduplication strategies through a feature flag, so I have instead introduced a new worker that is the exact copy of
AssignResourceFromResourceGroupWorker
, except it has a deduplication strategy ofuntil_executing
. (FF rollout issue: #460793 (closed)) - Switching between the new worker and the old worker when enabling/disabling feature flags should be okay. See: #460793 (closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
N/A. See validation steps.
How to set up and validate locally
The problem is impossible to replicate locally, but we can instead make sure that this change does not introduce any errors. We can also make sure that the correct workers are called depending on the status of the feature flag.
Setup
-
Create a project
-
Add a
.gitlab-ci.yml
and child.deploy.yml
pipeline configuration.gitlab-ci.yml
build: stage: build script: echo "building stuff" deploy_a: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_a_child" trigger: include: ".deploy.yml" strategy: depend # you can delete all the other deploy triggers below for faster testing deploy_b: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_b_child" trigger: include: ".deploy.yml" strategy: depend deploy_c: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_c_child" trigger: include: ".deploy.yml" strategy: depend deploy_d: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_d_child" trigger: include: ".deploy.yml" strategy: depend deploy_e: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_e_child" trigger: include: ".deploy.yml" strategy: depend
.deploy.yml
deploy1: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY" environment: name: production action: start deploy2: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY2" environment: name: production2 action: start deploy3: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY3" environment: name: production3 action: start deploy4: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY4" environment: name: production4 action: start deploy5: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY5" environment: name: production5 action: start
-
(Optional) Enable log level
:info
for your development environment by editing theconfig/environments/development.rb
file and adding aconfig.log_level = :info
line.-
in 2 different terminal windows, run the following in your GDK directory:
to check logs for
AssignResourceFromResourceGroupWorker
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::AssignResourceFromResourceGroupWorker"'
to check logs for
NewAssignResourceFromResourceGroupWorker
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::NewAssignResourceFromResourceGroupWorker"'
-
Testing
With the assign_resource_worker_deduplicate_until_executing
disabled, run the pipeline a couple of times and verify that AssignResourceFromResourceGroupWorker
is being called.
-
in
https://gdk.test:3443/admin/background_jobs
, check the Metrics tab and verify thatAssignResourceFromResourceGroupWorker
jobs are being run -
if you have enabled log level
:info
, verify that:- logs for
AssignResourceFromResourceGroupWorker
are showing - logs for
NewAssignResourceFromResourceGroupWorker
are NOT showing
- logs for
With the assign_resource_worker_deduplicate_until_executing
enabled, run the pipeline a couple of times and verify that NewAssignResourceFromResourceGroupWorker
is being called.
-
in
https://gdk.test:3443/admin/background_jobs
, check the Metrics tab and verify thatAssignResourceFromResourceGroupWorker
jobs are being run -
if you have enabled log level
:info
, verify that:- logs for
AssignResourceFromResourceGroupWorker
are NOT showing - logs for
NewAssignResourceFromResourceGroupWorker
are showing
- logs for
Related to #436988 (closed)
Merge request reports
Activity
changed milestone to %17.1
assigned to @partiaga
- A deleted user
added backend feature flag labels
- Resolved by Pam Artiaga
1 Warning 757efdfc: Commits that change 30 or more lines across at least 3 files should describe these changes in the commit body. For more information, take a look at our Commit message guidelines. 1 Message CHANGELOG missing: If this merge request needs a changelog entry, add the
Changelog
trailer to the commit message you want to add to the changelog.If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.
Reviewer roulette
Category Reviewer Maintainer backend @bhrai
(UTC+2, same timezone as author)
@wandering_person
(UTC+7, 5 hours ahead of author)
~"Verify" Reviewer review is optional for ~"Verify" @drew
(UTC+0, 2 hours behind author)
Please check reviewer's status!
Please refer to documentation page for guidance on how you can benefit from the Reviewer Roulette, or use the GitLab Review Workload Dashboard to find other available reviewers.
Sidekiq queue changes
This merge request contains changes to Sidekiq queues. Please follow the documentation on changing a queue's urgency.
These queues were added:
pipeline_processing:ci_resource_groups_assign_resource_from_resource_group_worker_v2
If needed, you can retry the
danger-review
job that generated this comment.Generated by
Dangermentioned in issue #408112
- Resolved by Pam Artiaga
- Resolved by Max Fan
Hi @hustewart would you mind giving this one the initial review since you already have context for this?
I believe that the most correct solution for this problem is actually to introduce a way in Sidekiq to allow jobs to be run one-at-a-time without dropping queued jobs at all. It's similar to the Sidekiq Unique Jobs mechanism, except we have to build it ourselves.
Before doing that, I would like to try this solution, which is to "relax" the deduplication strategy from
until_executed
tountil_executing
.I'd like your thoughts/feedback on this one, particularly whether this makes sense conceptually. If all is good, we can pull
@shinya.maeda
here as the maintainer reviewer.
requested review from @hustewart
added pipeline:mr-approved label
- Resolved by Max Fan
@hustewart
, thanks for approving this merge request.This is the first time the merge request has been approved. To ensure we don't only run predictive pipelines, and we don't break
master
, a new pipeline will be started shortly.Please wait for the pipeline to start before resolving this discussion and set auto-merge for the new pipeline. See merging a merge request for more details.
reset approvals from @hustewart by pushing to the branch
E2E Test Result Summary
allure-report-publisher
generated test report!e2e-test-on-gdk:
test report for 757efdfcexpand test summary
+------------------------------------------------------------------+ | suites summary | +-------------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +-------------+--------+--------+---------+-------+-------+--------+ | Data Stores | 62 | 0 | 0 | 44 | 62 | ✅ | | Monitor | 16 | 0 | 0 | 14 | 16 | ✅ | | Create | 222 | 0 | 18 | 188 | 240 | ✅ | | Verify | 62 | 0 | 2 | 60 | 64 | ✅ | | Govern | 132 | 0 | 0 | 86 | 132 | ✅ | | Manage | 0 | 0 | 2 | 0 | 2 | ➖ | | Plan | 108 | 0 | 4 | 94 | 112 | ✅ | | Package | 38 | 0 | 24 | 38 | 62 | ✅ | | Release | 10 | 0 | 0 | 10 | 10 | ✅ | | Analytics | 4 | 0 | 0 | 2 | 4 | ✅ | +-------------+--------+--------+---------+-------+-------+--------+ | Total | 654 | 0 | 50 | 536 | 704 | ✅ | +-------------+--------+--------+---------+-------+-------+--------+
e2e-package-and-test:
test report for 757efdfcexpand test summary
+------------------------------------------------------------------+ | suites summary | +-------------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +-------------+--------+--------+---------+-------+-------+--------+ | Govern | 178 | 0 | 8 | 2 | 186 | ✅ | | Create | 569 | 0 | 69 | 0 | 638 | ✅ | | Data Stores | 79 | 0 | 18 | 0 | 97 | ✅ | | Package | 123 | 0 | 62 | 0 | 185 | ✅ | | Plan | 183 | 0 | 14 | 0 | 197 | ✅ | | Fulfillment | 4 | 0 | 50 | 0 | 54 | ✅ | | Verify | 105 | 0 | 20 | 0 | 125 | ✅ | | Release | 11 | 0 | 2 | 0 | 13 | ✅ | | Monitor | 20 | 0 | 16 | 0 | 36 | ✅ | | Manage | 6 | 0 | 4 | 0 | 10 | ✅ | | Configure | 0 | 0 | 6 | 0 | 6 | ➖ | | Growth | 0 | 0 | 4 | 0 | 4 | ➖ | | Secure | 4 | 0 | 4 | 2 | 8 | ✅ | | ModelOps | 0 | 0 | 2 | 0 | 2 | ➖ | | Analytics | 4 | 0 | 0 | 0 | 4 | ✅ | | Ai-powered | 0 | 0 | 2 | 0 | 2 | ➖ | +-------------+--------+--------+---------+-------+-------+--------+ | Total | 1286 | 0 | 281 | 4 | 1567 | ✅ | +-------------+--------+--------+---------+-------+-------+--------+
- Resolved by Max Fan
removed pipeline:mr-approved label
requested review from @shinya.maeda and removed review request for @hustewart
added pipeline:mr-approved label
requested review from @allison.browne and removed review request for @shinya.maeda
added workflowin review label and removed workflowin dev label
added pipelinetier-2 label
requested review from @ayufan
requested review from @mfanGitLab
removed review request for @ayufan and @allison.browne
- Resolved by Max Fan
@partiaga LGTM!
1 Question though: Is there a reason/guideline why we didn't just change
AssignResourceFromResourceGroupWorker
tountil_executing
? Instead of creating a duplicate sidekiq worker with this one param change. I've seen this done before (maybe erroneously) and always thought it was ok as the deployment doesn't kill in progress workers.Approving and setting MWPS because it's FF, works and seems like there's a bit of a rush here.
added pipelinetier-3 label and removed pipelinetier-2 label
enabled an automatic merge when the pipeline for 2c7cf0b8 succeeds
mentioned in commit 2f3a7635
added workflowstaging-canary label and removed workflowin review label
added workflowcanary label and removed workflowstaging-canary label
added workflowstaging label and removed workflowcanary label
added workflowproduction label and removed workflowstaging label
mentioned in merge request !155407 (closed)
mentioned in issue #436988 (closed)
added releasedcandidate label
added releasedpublished label and removed releasedcandidate label
mentioned in merge request !157156 (merged)