CI Jobs with resource group get stuck "waiting for resource" due to a race condition
Summary
The AssignResourceFromResourceGroupWorker drops duplicate jobs with the same resource_group_id causing jobs to get stuck in "waiting for resource"
Further Context
We previously encountered a problem where a job with a resource group is stuck due to a race condition. This is due to the fact that the AssignResourceFromResourceGroupWorker
, which allocates a job to a resource group, can only run one at a time per resource group using the deduplicated: until_executed
strategy. This was resolved by extending the strategy with a if_deduplicated: reschedule_once
option and having AssignResourceFromResourceGroupWorker
include that option. (More details here: Pipeline job depends on Resource Group could be... (#342123 - closed))
Now, a customer is once again encountering this problem. They are a special case because they have dozens of projects, each project having a deployment job with a resource group, and they are on a GitLab Dedicated instance. These combination of factors mean that, even if the AssignResourceFromResourceGroupWorker
for a resource group is rescheduled if deduplicated, it could still run into a race condition if another AssignResourceFromResourceGroupWorker
is running for the same resource group.
Example scenarios of this problem happening:
Related issues
- Customer Reported Issues:
- Investigation issue: Investigate trigger job stuck at "Waiting for r... (#435437 - closed) (see comment: #435437 (comment 1711919114))
- Old issue: Pipeline job depends on Resource Group could be... (#342123 - closed)
Workaround
Cancel the pending job, and re-run the job.
Proposed Fix
See #436988 (comment 1802863303)
Have the AssignResourceFromResourceGroupWorker/Service
"re-spawn" itself
In the AssignResourceFromResourceGroupService
, make it automatically kickoff another AssignResourceFromResourceGroupWorker
for the same resource group if it was not able to assign a resource to an upcoming processable.
diff --git a/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb b/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
index d7078200c145..c932443131bd 100644
--- a/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
+++ b/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
@@ -9,9 +9,16 @@ def execute(resource_group)
free_resources = resource_group.resources.free.count
- resource_group.upcoming_processables.take(free_resources).each do |upcoming|
- Gitlab::OptimisticLocking.retry_lock(upcoming, name: 'enqueue_waiting_for_resource') do |processable|
- processable.enqueue_waiting_for_resource
+ if free_resources == 0
+ # if the resources are still 'tied up' in other processables
+ # just call the worker again to restart the workflow of
+ # checking for stale jobs, free resources, and upcoming processables
+ Ci::ResourceGroups::AssignResourceFromResourceGroupWorker.perform_async(resource_group.id)
+ else
+ resource_group.upcoming_processables.take(free_resources).each do |upcoming|
+ Gitlab::OptimisticLocking.retry_lock(upcoming, name: 'enqueue_waiting_for_resource') do |processable|
+ processable.enqueue_waiting_for_resource
+ end
end
end
end
Remove section in Troubleshooting Doc
Since this is a long-running problem, we have added a section about it in the Troubleshooting doc along with a workaround: Update Resource Groups troubleshooting doc (!149229 - merged)
Once this problem has been fully fixed, we need to remove the section relating to it.