Re-spawn the AssignResource worker if busy (!147313) · Merge requests · GitLab.org / GitLab

Pam Artiaga requested to merge 436988-respawn-AssignResourceFromResourceGroupWorker into master Mar 19, 2024

What does this MR do and why?

We previously encountered a problem where a job with a resource group is stuck due to a race condition. This is due to the fact that the AssignResourceFromResourceGroupWorker, which allocates a job to a resource group, can only run one at a time per resource group using the deduplicated: until_executed strategy. This was resolved by adding a if_deduplicated: reschedule_once option to the AssignResourceFromResourceGroupWorker. (More details here: Pipeline job depends on Resource Group could be... (#342123 - closed).)

Now, it turns out that we are still running into race conditions for pipelines that are run in parallel, or pipelines with multiple downstream/child pipelines that then run in parallel. Essentially, there is a situation where the AssignResourceFromResourceGroupWorker might stop assigning jobs to a specific resource group because it checks if a resource is free before a resource is freed.

This MR solves this "stuck" situation by kicking off a AssignResourceFromResourceGroupWorker job for a resource group if:

there are no "free" resources yet, AND
there are still more upcoming processables/builds for that resource group

The idea is that for the next round of AssignResourceFromResourceGroupWorker, the resource would already be free and can be assigned to a build.

This change is behind a Feature Flag. Rollout issue: [Feature flag] Rollout of `respawn_assign_resou... (#450793 - closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A

How to set up and validate locally

This is actually very hard to replicate locally and, so far, I've only observed this in this example group: https://gitlab.com/dkua1_ultimate_group/private/zd/gitlab-job-stuck-at-waiting-status-460524.

I would suggest just testing that there are no errors happening with this setup:

Create a project
Add a .gitlab-ci.yml and child .deploy.yml pipeline configuration (see example above)

Run the pipeline several times with the Feature Flag respawn_assign_resource_worker enabled and disabled.

.gitlab-ci.yml

# note: the script here is just based on a user-reported issue with this particular problem,
#       where there is a job that changes the resource_group's process_mode during pipeline execution,
#       further exacerbating the possibility of the race condition happening
build:
  stage: build
  resource_group: "resource_group_1"
  script:
  - apk add --no-cache curl
  - |
    curl "https://gdk.test:3443/api/v4/projects/32/resource_groups/resource_group_1" \
      -k -X PUT \
      --header "Authorization: Bearer <the-personal-access-token>" \
      --data "{\"process_mode\": \"oldest_first\"}"

deploy:
  stage: deploy
  resource_group: "resource_group_1"
  trigger:
    include: ".deploy.yml"
    strategy: depend

.deploy.yml child pipeline configuration

deploy:
  stage: deploy
  script:
    - echo "DEPLOY"
  environment:
    name: production
    action: start

deploy2:
  stage: deploy
  script:
    - echo "DEPLOY2"
  environment:
    name: production2
    action: start

deploy3:
  stage: deploy
  script:
    - echo "DEPLOY3"
  environment:
    name: production3
    action: start
  
deploy4:
  stage: deploy
  script:
    - echo "DEPLOY4"
  environment:
    name: production4
    action: start

deploy5:
  stage: deploy
  script:
    - echo "DEPLOY5"
  environment:
    name: production5
    action: start

Related to #436988 (closed)

Edited Mar 27, 2024 by Pam Artiaga

Re-spawn the AssignResource worker if busy

What does this MR do and why?

MR acceptance checklist

Screenshots or screen recordings

How to set up and validate locally

Merge request reports