CI Jobs with resource group get stuck "waiting for resource" due to a race condition

Summary

The AssignResourceFromResourceGroupWorker drops duplicate jobs with the same resource_group_id causing jobs to get stuck in "waiting for resource"

Further Context

We previously encountered a problem where a job with a resource group is stuck due to a race condition. This is due to the fact that the AssignResourceFromResourceGroupWorker, which allocates a job to a resource group, can only run one at a time per resource group using the deduplicated: until_executed strategy. This was resolved by extending the strategy with a if_deduplicated: reschedule_once option and having AssignResourceFromResourceGroupWorker include that option. (More details here: Pipeline job depends on Resource Group could be... (#342123 - closed))

Now, a customer is once again encountering this problem. They are a special case because they have dozens of projects, each project having a deployment job with a resource group, and they are on a GitLab Dedicated instance. These combination of factors mean that, even if the AssignResourceFromResourceGroupWorker for a resource group is rescheduled if deduplicated, it could still run into a race condition if another AssignResourceFromResourceGroupWorker is running for the same resource group.

Example scenarios of this problem happening:

Related issues

Customer Reported Issues:
- https://gitlab.com/gitlab-com/ops-sub-department/section-ops-request-for-help/-/issues/255+
- Large volume of jobs utilizing resource groups ... (#435084 - closed)
Investigation issue: Investigate trigger job stuck at "Waiting for r... (#435437 - closed) (see comment: #435437 (comment 1711919114))
Old issue: Pipeline job depends on Resource Group could be... (#342123 - closed)

Workaround

Cancel the pending job, and re-run the job.

Proposed Fix

See #436988 (comment 1802863303)

Have the AssignResourceFromResourceGroupWorker/Service "re-spawn" itself

In the AssignResourceFromResourceGroupService, make it automatically kickoff another AssignResourceFromResourceGroupWorker for the same resource group if it was not able to assign a resource to an upcoming processable.

diff --git a/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb b/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
index d7078200c145..c932443131bd 100644
--- a/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
+++ b/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
@@ -9,9 +9,16 @@ def execute(resource_group)
 
         free_resources = resource_group.resources.free.count
 
-        resource_group.upcoming_processables.take(free_resources).each do |upcoming|
-          Gitlab::OptimisticLocking.retry_lock(upcoming, name: 'enqueue_waiting_for_resource') do |processable|
-            processable.enqueue_waiting_for_resource
+        if free_resources == 0
+          # if the resources are still 'tied up' in other processables
+          # just call the worker again to restart the workflow of
+          # checking for stale jobs, free resources, and upcoming processables
+          Ci::ResourceGroups::AssignResourceFromResourceGroupWorker.perform_async(resource_group.id)
+        else
+          resource_group.upcoming_processables.take(free_resources).each do |upcoming|
+            Gitlab::OptimisticLocking.retry_lock(upcoming, name: 'enqueue_waiting_for_resource') do |processable|
+              processable.enqueue_waiting_for_resource
+            end
           end
         end
       end

Remove section in Troubleshooting Doc

Since this is a long-running problem, we have added a section about it in the Troubleshooting doc along with a workaround: Update Resource Groups troubleshooting doc (!149229 - merged)

Once this problem has been fully fixed, we need to remove the section relating to it.

Edited Jun 17, 2024 by Pam Artiaga