Skip to content
Snippets Groups Projects

Use until_executing when deduplicating the "assign resource worker"

All threads resolved!

What does this MR do and why?

Jobs with a resource group sometimes gets stuck "waiting for resource" due to a race condition. (See: https://docs.gitlab.com/ee/ci/resource_groups/#race-conditions-in-complex-or-busy-pipelines)

This problem is due to the fact that the AssignResourceFromResourceGroupWorker, which allocates a job to a resource group, is deduplicated with an until_executed strategy. This means that if a job for a resource group is still running, a newly queued job for the same resource group gets dropped.

We came up with different solutions to resolve this, and settled on "re-spawning" the worker when certain conditions are met. However, upon verifying in production, we observed that !147313 (merged) did not really fix the problem because the "re-spawned" job also runs into race conditions (see #436988 (comment 1856609263)).

The change

This current MR tackles the actual cause of the problem, which is: jobs being dropped if another job for the same resource group is RUNNING. Here, we change the deduplication strategy to until_executing, which means that jobs will be dropped if another job for the same resource group is QUEUED; if the job is already running, new jobs can be queued. I believe that this change, in combination with the first fix, will prevent the possibility of jobs getting stuck at "waiting for resource".

Caveats and considerations

This issue is impossible to replicate locally, so it is very hard to verify the actual effectiveness of the fix.

  • This is instead introduced behind a feature flag, which will be enabled for example projects in production, where I will test the changes
  • It's not possible to switch deduplication strategies through a feature flag, so I have instead introduced a new worker that is the exact copy of AssignResourceFromResourceGroupWorker, except it has a deduplication strategy of until_executing. (FF rollout issue: #460793 (closed))
  • Switching between the new worker and the old worker when enabling/disabling feature flags should be okay. See: #460793 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A. See validation steps.

How to set up and validate locally

The problem is impossible to replicate locally, but we can instead make sure that this change does not introduce any errors. We can also make sure that the correct workers are called depending on the status of the feature flag.

Setup

  1. Create a project

  2. Add a .gitlab-ci.yml and child .deploy.yml pipeline configuration

    .gitlab-ci.yml
    build:
      stage: build
      script: echo "building stuff"
    
    deploy_a:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_a_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    # you can delete all the other deploy triggers below for faster testing
    deploy_b:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_b_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    deploy_c:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_c_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    deploy_d:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_d_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    deploy_e:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_e_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    .deploy.yml
    deploy1:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY"
      environment:
        name: production
        action: start
    
    deploy2:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY2"
      environment:
        name: production2
        action: start
    
    deploy3:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY3"
      environment:
        name: production3
        action: start
      
    deploy4:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY4"
      environment:
        name: production4
        action: start
    
    deploy5:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY5"
      environment:
        name: production5
        action: start
  3. (Optional) Enable log level :info for your development environment by editing the config/environments/development.rb file and adding a config.log_level = :info line.

    • in 2 different terminal windows, run the following in your GDK directory:

      to check logs for AssignResourceFromResourceGroupWorker

      gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::AssignResourceFromResourceGroupWorker"'

      to check logs for NewAssignResourceFromResourceGroupWorker

      gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::NewAssignResourceFromResourceGroupWorker"'

Testing

With the assign_resource_worker_deduplicate_until_executing disabled, run the pipeline a couple of times and verify that AssignResourceFromResourceGroupWorker is being called.

  • in https://gdk.test:3443/admin/background_jobs, check the Metrics tab and verify that AssignResourceFromResourceGroupWorker jobs are being run

    expand for screenshot

    Screenshot_2024-05-09_at_14.30.55

  • if you have enabled log level :info, verify that:

    • logs for AssignResourceFromResourceGroupWorker are showing
    • logs for NewAssignResourceFromResourceGroupWorker are NOT showing

With the assign_resource_worker_deduplicate_until_executing enabled, run the pipeline a couple of times and verify that NewAssignResourceFromResourceGroupWorker is being called.

  • in https://gdk.test:3443/admin/background_jobs, check the Metrics tab and verify that AssignResourceFromResourceGroupWorker jobs are being run

    expand for screenshot

    Screenshot_2024-05-09_at_14.47.52

  • if you have enabled log level :info, verify that:

    • logs for AssignResourceFromResourceGroupWorker are NOT showing
    • logs for NewAssignResourceFromResourceGroupWorker are showing

Related to #436988 (closed)

Edited by Pam Artiaga

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • 1 Warning
    :warning: 757efdfc: Commits that change 30 or more lines across at least 3 files should describe these changes in the commit body. For more information, take a look at our Commit message guidelines.
    1 Message
    :book: CHANGELOG missing:

    If this merge request needs a changelog entry, add the Changelog trailer to the commit message you want to add to the changelog.

    If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.

    Reviewer roulette

    Category Reviewer Maintainer
    backend @bhrai profile link current availability (UTC+2, same timezone as author) @wandering_person profile link current availability (UTC+7, 5 hours ahead of author)
    ~"Verify" Reviewer review is optional for ~"Verify" @drew profile link current availability (UTC+0, 2 hours behind author)

    Please check reviewer's status!

    • available Reviewer is available!
    • unavailable Reviewer is unavailable!

    Please refer to documentation page for guidance on how you can benefit from the Reviewer Roulette, or use the GitLab Review Workload Dashboard to find other available reviewers.

    Sidekiq queue changes

    This merge request contains changes to Sidekiq queues. Please follow the documentation on changing a queue's urgency.

    These queues were added:

    • pipeline_processing:ci_resource_groups_assign_resource_from_resource_group_worker_v2

    If needed, you can retry the :repeat: danger-review job that generated this comment.

    Generated by :no_entry_sign: Danger

  • Pam Artiaga added 1 commit

    added 1 commit

    • a61d4551 - Fix worker perform calls, lint errors

    Compare with previous version

  • Pam Artiaga resolved all threads

    resolved all threads

  • Pam Artiaga added 1 commit

    added 1 commit

    • 172a3c33 - Add MR url to feature flag definition

    Compare with previous version

  • Pam Artiaga mentioned in issue #408112

    mentioned in issue #408112

  • Pam Artiaga
  • Pam Artiaga added 1 commit

    added 1 commit

    • d076c0f1 - Update feature flag milestone

    Compare with previous version

  • Pam Artiaga changed the description

    changed the description

  • Pam Artiaga changed the description

    changed the description

  • Pam Artiaga changed the description

    changed the description

  • Pam Artiaga marked this merge request as ready

    marked this merge request as ready

  • Pam Artiaga changed title from Introduce a new worker for assigning processables to resource group to Use until_executing strategy when deduplicating the "assign resource worker"

    changed title from Introduce a new worker for assigning processables to resource group to Use until_executing strategy when deduplicating the "assign resource worker"

    • Author Maintainer
      Resolved by Max Fan

      Hi @hustewart would you mind giving this one the initial review since you already have context for this?

      I believe that the most correct solution for this problem is actually to introduce a way in Sidekiq to allow jobs to be run one-at-a-time without dropping queued jobs at all. It's similar to the Sidekiq Unique Jobs mechanism, except we have to build it ourselves.

      Before doing that, I would like to try this solution, which is to "relax" the deduplication strategy from until_executed to until_executing.

      I'd like your thoughts/feedback on this one, particularly whether this makes sense conceptually. If all is good, we can pull @shinya.maeda here as the maintainer reviewer.

  • Pam Artiaga requested review from @hustewart

    requested review from @hustewart

  • Hunter Stewart approved this merge request

    approved this merge request

    • Resolved by Max Fan

      :wave: @hustewart, thanks for approving this merge request.

      This is the first time the merge request has been approved. To ensure we don't only run predictive pipelines, and we don't break master, a new pipeline will be started shortly.

      Please wait for the pipeline to start before resolving this discussion and set auto-merge for the new pipeline. See merging a merge request for more details.

  • Pam Artiaga added 1 commit

    added 1 commit

    Compare with previous version

  • Pam Artiaga reset approvals from @hustewart by pushing to the branch

    reset approvals from @hustewart by pushing to the branch

  • E2E Test Result Summary

    allure-report-publisher generated test report!

    e2e-test-on-gdk: :white_check_mark: test report for 757efdfc

    expand test summary
    +------------------------------------------------------------------+
    |                          suites summary                          |
    +-------------+--------+--------+---------+-------+-------+--------+
    |             | passed | failed | skipped | flaky | total | result |
    +-------------+--------+--------+---------+-------+-------+--------+
    | Data Stores | 62     | 0      | 0       | 44    | 62    | ✅     |
    | Monitor     | 16     | 0      | 0       | 14    | 16    | ✅     |
    | Create      | 222    | 0      | 18      | 188   | 240   | ✅     |
    | Verify      | 62     | 0      | 2       | 60    | 64    | ✅     |
    | Govern      | 132    | 0      | 0       | 86    | 132   | ✅     |
    | Manage      | 0      | 0      | 2       | 0     | 2     | ➖     |
    | Plan        | 108    | 0      | 4       | 94    | 112   | ✅     |
    | Package     | 38     | 0      | 24      | 38    | 62    | ✅     |
    | Release     | 10     | 0      | 0       | 10    | 10    | ✅     |
    | Analytics   | 4      | 0      | 0       | 2     | 4     | ✅     |
    +-------------+--------+--------+---------+-------+-------+--------+
    | Total       | 654    | 0      | 50      | 536   | 704   | ✅     |
    +-------------+--------+--------+---------+-------+-------+--------+

    e2e-package-and-test: :white_check_mark: test report for 757efdfc

    expand test summary
    +------------------------------------------------------------------+
    |                          suites summary                          |
    +-------------+--------+--------+---------+-------+-------+--------+
    |             | passed | failed | skipped | flaky | total | result |
    +-------------+--------+--------+---------+-------+-------+--------+
    | Govern      | 178    | 0      | 8       | 2     | 186   | ✅     |
    | Create      | 569    | 0      | 69      | 0     | 638   | ✅     |
    | Data Stores | 79     | 0      | 18      | 0     | 97    | ✅     |
    | Package     | 123    | 0      | 62      | 0     | 185   | ✅     |
    | Plan        | 183    | 0      | 14      | 0     | 197   | ✅     |
    | Fulfillment | 4      | 0      | 50      | 0     | 54    | ✅     |
    | Verify      | 105    | 0      | 20      | 0     | 125   | ✅     |
    | Release     | 11     | 0      | 2       | 0     | 13    | ✅     |
    | Monitor     | 20     | 0      | 16      | 0     | 36    | ✅     |
    | Manage      | 6      | 0      | 4       | 0     | 10    | ✅     |
    | Configure   | 0      | 0      | 6       | 0     | 6     | ➖     |
    | Growth      | 0      | 0      | 4       | 0     | 4     | ➖     |
    | Secure      | 4      | 0      | 4       | 2     | 8     | ✅     |
    | ModelOps    | 0      | 0      | 2       | 0     | 2     | ➖     |
    | Analytics   | 4      | 0      | 0       | 0     | 4     | ✅     |
    | Ai-powered  | 0      | 0      | 2       | 0     | 2     | ➖     |
    +-------------+--------+--------+---------+-------+-------+--------+
    | Total       | 1286   | 0      | 281     | 4     | 1567  | ✅     |
    +-------------+--------+--------+---------+-------+-------+--------+
  • Pam Artiaga
  • Pam Artiaga changed title from Use until_executing strategy when deduplicating the "assign resource worker" to Use until_executing when deduplicating the "assign resource worker"

    changed title from Use until_executing strategy when deduplicating the "assign resource worker" to Use until_executing when deduplicating the "assign resource worker"

  • Hunter Stewart approved this merge request

    approved this merge request

  • Hunter Stewart requested review from @shinya.maeda and removed review request for @hustewart

    requested review from @shinya.maeda and removed review request for @hustewart

  • Shinya Maeda approved this merge request

    approved this merge request

  • Shinya Maeda requested review from @allison.browne and removed review request for @shinya.maeda

    requested review from @allison.browne and removed review request for @shinya.maeda

  • added workflowin review label and removed workflowin dev label

  • Pam Artiaga requested review from @ayufan

    requested review from @ayufan

  • Allison Browne requested review from @mfanGitLab

    requested review from @mfanGitLab

  • Allison Browne removed review request for @ayufan and @allison.browne

    removed review request for @ayufan and @allison.browne

  • Max Fan resolved all threads

    resolved all threads

  • Max Fan approved this merge request

    approved this merge request

    • Resolved by Max Fan

      @partiaga LGTM!

      1 Question though: Is there a reason/guideline why we didn't just change AssignResourceFromResourceGroupWorker to until_executing? Instead of creating a duplicate sidekiq worker with this one param change. I've seen this done before (maybe erroneously) and always thought it was ok as the deployment doesn't kill in progress workers.

      Approving and setting MWPS because it's FF, works and seems like there's a bit of a rush here.

  • added pipelinetier-3 label and removed pipelinetier-2 label

  • Max Fan enabled an automatic merge when the pipeline for 2c7cf0b8 succeeds

    enabled an automatic merge when the pipeline for 2c7cf0b8 succeeds

  • merged

  • Max Fan mentioned in commit 2f3a7635

    mentioned in commit 2f3a7635

  • added workflowstaging label and removed workflowcanary label

  • Max Fan resolved all threads

    resolved all threads

  • Pam Artiaga mentioned in merge request !155407 (closed)

    mentioned in merge request !155407 (closed)

  • mentioned in issue #436988 (closed)

  • Pam Artiaga mentioned in merge request !157156 (merged)

    mentioned in merge request !157156 (merged)

  • Please register or sign in to reply
    Loading