Use until_executing when deduplicating the "assign resource worker"
What does this MR do and why?
Jobs with a resource group sometimes gets stuck "waiting for resource" due to a race condition. (See: https://docs.gitlab.com/ee/ci/resource_groups/#race-conditions-in-complex-or-busy-pipelines)
This problem is due to the fact that the AssignResourceFromResourceGroupWorker
, which allocates a job to a resource group, is deduplicated
with an until_executed
strategy. This means that if a job for a resource group is still running, a newly queued job for the same resource group gets dropped.
We came up with different solutions to resolve this, and settled on "re-spawning" the worker when certain conditions are met. However, upon verifying in production, we observed that !147313 (merged) did not really fix the problem because the "re-spawned" job also runs into race conditions (see #436988 (comment 1856609263)).
The change
This current MR tackles the actual cause of the problem, which is: jobs being dropped if another job for the same resource group is RUNNING. Here, we change the deduplication strategy to until_executing
, which means that jobs will be dropped if another job for the same resource group is QUEUED; if the job is already running, new jobs can be queued. I believe that this change, in combination with the first fix, will prevent the possibility of jobs getting stuck at "waiting for resource".
Caveats and considerations
This issue is impossible to replicate locally, so it is very hard to verify the actual effectiveness of the fix.
- This is instead introduced behind a feature flag, which will be enabled for example projects in production, where I will test the changes
- It's not possible to switch deduplication strategies through a feature flag, so I have instead introduced a new worker that is the exact copy of
AssignResourceFromResourceGroupWorker
, except it has a deduplication strategy ofuntil_executing
. (FF rollout issue: #460793 (closed)) - Switching between the new worker and the old worker when enabling/disabling feature flags should be okay. See: #460793 (closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
N/A. See validation steps.
How to set up and validate locally
The problem is impossible to replicate locally, but we can instead make sure that this change does not introduce any errors. We can also make sure that the correct workers are called depending on the status of the feature flag.
Setup
-
Create a project
-
Add a
.gitlab-ci.yml
and child.deploy.yml
pipeline configuration.gitlab-ci.yml
build: stage: build script: echo "building stuff" deploy_a: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_a_child" trigger: include: ".deploy.yml" strategy: depend # you can delete all the other deploy triggers below for faster testing deploy_b: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_b_child" trigger: include: ".deploy.yml" strategy: depend deploy_c: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_c_child" trigger: include: ".deploy.yml" strategy: depend deploy_d: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_d_child" trigger: include: ".deploy.yml" strategy: depend deploy_e: stage: deploy variables: RESOURCE_GROUP_KEY: "resource_group_e_child" trigger: include: ".deploy.yml" strategy: depend
.deploy.yml
deploy1: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY" environment: name: production action: start deploy2: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY2" environment: name: production2 action: start deploy3: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY3" environment: name: production3 action: start deploy4: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY4" environment: name: production4 action: start deploy5: stage: deploy resource_group: $RESOURCE_GROUP_KEY script: - echo "DEPLOY5" environment: name: production5 action: start
-
(Optional) Enable log level
:info
for your development environment by editing theconfig/environments/development.rb
file and adding aconfig.log_level = :info
line.-
in 2 different terminal windows, run the following in your GDK directory:
to check logs for
AssignResourceFromResourceGroupWorker
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::AssignResourceFromResourceGroupWorker"'
to check logs for
NewAssignResourceFromResourceGroupWorker
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::NewAssignResourceFromResourceGroupWorker"'
-
Testing
With the assign_resource_worker_deduplicate_until_executing
disabled, run the pipeline a couple of times and verify that AssignResourceFromResourceGroupWorker
is being called.
-
in
https://gdk.test:3443/admin/background_jobs
, check the Metrics tab and verify thatAssignResourceFromResourceGroupWorker
jobs are being run -
if you have enabled log level
:info
, verify that:- logs for
AssignResourceFromResourceGroupWorker
are showing - logs for
NewAssignResourceFromResourceGroupWorker
are NOT showing
- logs for
With the assign_resource_worker_deduplicate_until_executing
enabled, run the pipeline a couple of times and verify that NewAssignResourceFromResourceGroupWorker
is being called.
-
in
https://gdk.test:3443/admin/background_jobs
, check the Metrics tab and verify thatAssignResourceFromResourceGroupWorker
jobs are being run -
if you have enabled log level
:info
, verify that:- logs for
AssignResourceFromResourceGroupWorker
are NOT showing - logs for
NewAssignResourceFromResourceGroupWorker
are showing
- logs for
Related to #436988 (closed)