Use until_executing when deduplicating the "assign resource worker"
What does this MR do and why?
Jobs with a resource group sometimes gets stuck "waiting for resource" due to a race condition. (See: https://docs.gitlab.com/ee/ci/resource_groups/#race-conditions-in-complex-or-busy-pipelines)
This problem is due to the fact that the AssignResourceFromResourceGroupWorker
, which allocates a job to a resource group, is deduplicated
with an until_executed
strategy. This means that if a job for a resource group is still running, a newly queued job for the same resource group gets dropped.
We came up with different solutions to resolve this, and settled on "re-spawning" the worker when certain conditions are met. However, upon verifying in production, we observed that !147313 (merged) did not really fix the problem because the "re-spawned" job also runs into race conditions (see #436988 (comment 1856609263)).
The change
This current MR tackles the actual cause of the problem, which is: jobs being dropped if another job for the same resource group is RUNNING. Here, we change the deduplication strategy to until_executing
, which means that jobs will be dropped if another job for the same resource group is QUEUED; if the job is already running, new jobs can be queued. I believe that this change, in combination with the first fix, will prevent the possibility of jobs getting stuck at "waiting for resource".
Caveats and considerations
This issue is impossible to replicate locally, so it is very hard to verify the actual effectiveness of the fix.
- This is instead introduced behind a feature flag, which will be enabled for example projects in production, where I will test the changes
- It's not possible to switch deduplication strategies through a feature flag, so I have instead introduced a new worker that is the exact copy of
AssignResourceFromResourceGroupWorker
, except it has a deduplication strategy ofuntil_executing
. (FF rollout issue: #460793 (closed)) - Switching between the new worker and the old worker when enabling/disabling feature flags should be okay. See: #460793 (closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
N/A. See validation steps.
How to set up and validate locally
The problem is impossible to replicate locally, but we can instead make sure that this change does not introduce any errors. We can also make sure that the correct workers are called depending on the status of the feature flag.
Setup
-
Create a project
-
Add a
.gitlab-ci.yml
and child.deploy.yml
pipeline configuration.gitlab-ci.yml
build: stage: build script: echo "building stuff" deploy_a: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_a_child" trigger: include: ".deploy.yml" strategy: depend # you can delete all the other deploy triggers below for faster testing deploy_b: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_b_child" trigger: include: ".deploy.yml" strategy: depend deploy_c: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_c_child" trigger: include: ".deploy.yml" strategy: depend deploy_d: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_d_child" trigger: include: ".deploy.yml" strategy: depend deploy_e: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_e_child" trigger: include: ".deploy.yml" strategy: depend
.deploy.yml
deploy1: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY" environment: name: production action: start deploy2: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY2" environment: name: production2 action: start deploy3: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY3" environment: name: production3 action: start deploy4: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY4" environment: name: production4 action: start deploy5: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY5" environment: name: production5 action: start
-
(Optional) Enable log level
:info
for your development environment by editing theconfig/environments/development.rb
file and adding aconfig.log_level = :info
line.-
in 2 different terminal windows, run the following in your GDK directory:
to check logs for
AssignResourceFromResourceGroupWorker
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::AssignResourceFromResourceGroupWorker"'
to check logs for
NewAssignResourceFromResourceGroupWorker
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::NewAssignResourceFromResourceGroupWorker"'
-
Testing
With the assign_resource_worker_deduplicate_until_executing
disabled, run the pipeline a couple of times and verify that AssignResourceFromResourceGroupWorker
is being called.
-
in
https://gdk.test:3443/admin/background_jobs
, check the Metrics tab and verify thatAssignResourceFromResourceGroupWorker
jobs are being run -
if you have enabled log level
:info
, verify that:- logs for
AssignResourceFromResourceGroupWorker
are showing - logs for
NewAssignResourceFromResourceGroupWorker
are NOT showing
- logs for
With the assign_resource_worker_deduplicate_until_executing
enabled, run the pipeline a couple of times and verify that NewAssignResourceFromResourceGroupWorker
is being called.
-
in
https://gdk.test:3443/admin/background_jobs
, check the Metrics tab and verify thatAssignResourceFromResourceGroupWorker
jobs are being run -
if you have enabled log level
:info
, verify that:- logs for
AssignResourceFromResourceGroupWorker
are NOT showing - logs for
NewAssignResourceFromResourceGroupWorker
are showing
- logs for
Related to #436988 (closed)
Merge request reports
Activity
changed milestone to %17.1
assigned to @partiaga
- A deleted user
added backend feature flag labels
- Resolved by Pam Artiaga
1 Warning 757efdfc: Commits that change 30 or more lines across at least 3 files should describe these changes in the commit body. For more information, take a look at our Commit message guidelines. 1 Message CHANGELOG missing: If this merge request needs a changelog entry, add the
Changelog
trailer to the commit message you want to add to the changelog.If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.
Reviewer roulette
Category Reviewer Maintainer backend @bhrai
(UTC+2, same timezone as author)
@wandering_person
(UTC+7, 5 hours ahead of author)
~"Verify" Reviewer review is optional for ~"Verify" @drew
(UTC+0, 2 hours behind author)
Please check reviewer's status!
Please refer to documentation page for guidance on how you can benefit from the Reviewer Roulette, or use the GitLab Review Workload Dashboard to find other available reviewers.
Sidekiq queue changes
This merge request contains changes to Sidekiq queues. Please follow the documentation on changing a queue's urgency.
These queues were added:
pipeline_processing:ci_resource_groups_assign_resource_from_resource_group_worker_v2
If needed, you can retry the
danger-review
job that generated this comment.Generated by
DangerEdited by Ghost Usermentioned in issue #408112
- Resolved by Pam Artiaga