Use until_executing when deduplicating the "assign resource worker" (!152303) · Merge requests · GitLab.org / GitLab

Pam Artiaga requested to merge 436988-use-until-executing-deduplication-strategy-for-assign-resource-worker into master May 08, 2024

What does this MR do and why?

Jobs with a resource group sometimes gets stuck "waiting for resource" due to a race condition. (See: https://docs.gitlab.com/ee/ci/resource_groups/#race-conditions-in-complex-or-busy-pipelines)

This problem is due to the fact that the AssignResourceFromResourceGroupWorker, which allocates a job to a resource group, is deduplicated with an until_executed strategy. This means that if a job for a resource group is still running, a newly queued job for the same resource group gets dropped.

We came up with different solutions to resolve this, and settled on "re-spawning" the worker when certain conditions are met. However, upon verifying in production, we observed that !147313 (merged) did not really fix the problem because the "re-spawned" job also runs into race conditions (see #436988 (comment 1856609263)).

The change

This current MR tackles the actual cause of the problem, which is: jobs being dropped if another job for the same resource group is RUNNING. Here, we change the deduplication strategy to until_executing, which means that jobs will be dropped if another job for the same resource group is QUEUED; if the job is already running, new jobs can be queued. I believe that this change, in combination with the first fix, will prevent the possibility of jobs getting stuck at "waiting for resource".

Caveats and considerations

This issue is impossible to replicate locally, so it is very hard to verify the actual effectiveness of the fix.

This is instead introduced behind a feature flag, which will be enabled for example projects in production, where I will test the changes
It's not possible to switch deduplication strategies through a feature flag, so I have instead introduced a new worker that is the exact copy of AssignResourceFromResourceGroupWorker, except it has a deduplication strategy of until_executing. (FF rollout issue: #460793 (closed))
Switching between the new worker and the old worker when enabling/disabling feature flags should be okay. See: #460793 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A. See validation steps.

How to set up and validate locally

The problem is impossible to replicate locally, but we can instead make sure that this change does not introduce any errors. We can also make sure that the correct workers are called depending on the status of the feature flag.

Setup

Create a project

Add a .gitlab-ci.yml and child .deploy.yml pipeline configuration

.gitlab-ci.yml

build:
  stage: build
  script: echo "building stuff"

deploy_a:
  stage: deploy
  variables:
    RESOURCE_GROUP_KEY: "resource_group_a_child"
  trigger:
    include: ".deploy.yml"
    strategy: depend

# you can delete all the other deploy triggers below for faster testing
deploy_b:
  stage: deploy
  variables:
    RESOURCE_GROUP_KEY: "resource_group_b_child"
  trigger:
    include: ".deploy.yml"
    strategy: depend

deploy_c:
  stage: deploy
  variables:
    RESOURCE_GROUP_KEY: "resource_group_c_child"
  trigger:
    include: ".deploy.yml"
    strategy: depend

deploy_d:
  stage: deploy
  variables:
    RESOURCE_GROUP_KEY: "resource_group_d_child"
  trigger:
    include: ".deploy.yml"
    strategy: depend

deploy_e:
  stage: deploy
  variables:
    RESOURCE_GROUP_KEY: "resource_group_e_child"
  trigger:
    include: ".deploy.yml"
    strategy: depend

.deploy.yml

deploy1:
  stage: deploy
  resource_group: $RESOURCE_GROUP_KEY
  script:
    - echo "DEPLOY"
  environment:
    name: production
    action: start

deploy2:
  stage: deploy
  resource_group: $RESOURCE_GROUP_KEY
  script:
    - echo "DEPLOY2"
  environment:
    name: production2
    action: start

deploy3:
  stage: deploy
  resource_group: $RESOURCE_GROUP_KEY
  script:
    - echo "DEPLOY3"
  environment:
    name: production3
    action: start
  
deploy4:
  stage: deploy
  resource_group: $RESOURCE_GROUP_KEY
  script:
    - echo "DEPLOY4"
  environment:
    name: production4
    action: start

deploy5:
  stage: deploy
  resource_group: $RESOURCE_GROUP_KEY
  script:
    - echo "DEPLOY5"
  environment:
    name: production5
    action: start

(Optional) Enable log level :info for your development environment by editing the config/environments/development.rb file and adding a config.log_level = :info line.
- in 2 different terminal windows, run the following in your GDK directory:
  
  to check logs for AssignResourceFromResourceGroupWorker
```
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::AssignResourceFromResourceGroupWorker"'
```
  to check logs for NewAssignResourceFromResourceGroupWorker
```
gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::NewAssignResourceFromResourceGroupWorker"'
```

Testing

With the assign_resource_worker_deduplicate_until_executing disabled, run the pipeline a couple of times and verify that AssignResourceFromResourceGroupWorker is being called.

in https://gdk.test:3443/admin/background_jobs, check the Metrics tab and verify that AssignResourceFromResourceGroupWorker jobs are being run

expand for screenshot
if you have enabled log level :info, verify that:
- logs for AssignResourceFromResourceGroupWorker are showing
- logs for NewAssignResourceFromResourceGroupWorker are NOT showing

With the assign_resource_worker_deduplicate_until_executing enabled, run the pipeline a couple of times and verify that NewAssignResourceFromResourceGroupWorker is being called.

in https://gdk.test:3443/admin/background_jobs, check the Metrics tab and verify that AssignResourceFromResourceGroupWorker jobs are being run

expand for screenshot
if you have enabled log level :info, verify that:
- logs for AssignResourceFromResourceGroupWorker are NOT showing
- logs for NewAssignResourceFromResourceGroupWorker are showing

Related to #436988 (closed)

Edited May 09, 2024 by Pam Artiaga

Use until_executing when deduplicating the "assign resource worker"