Skip to content

Re-spawn the AssignResource worker if busy

What does this MR do and why?

We previously encountered a problem where a job with a resource group is stuck due to a race condition. This is due to the fact that the AssignResourceFromResourceGroupWorker, which allocates a job to a resource group, can only run one at a time per resource group using the deduplicated: until_executed strategy. This was resolved by adding a if_deduplicated: reschedule_once option to the AssignResourceFromResourceGroupWorker. (More details here: Pipeline job depends on Resource Group could be... (#342123 - closed).)

Now, it turns out that we are still running into race conditions for pipelines that are run in parallel, or pipelines with multiple downstream/child pipelines that then run in parallel. Essentially, there is a situation where the AssignResourceFromResourceGroupWorker might stop assigning jobs to a specific resource group because it checks if a resource is free before a resource is freed.

This MR solves this "stuck" situation by kicking off a AssignResourceFromResourceGroupWorker job for a resource group if:

  • there are no "free" resources yet, AND
  • there are still more upcoming processables/builds for that resource group

The idea is that for the next round of AssignResourceFromResourceGroupWorker, the resource would already be free and can be assigned to a build.

This change is behind a Feature Flag. Rollout issue: [Feature flag] Rollout of `respawn_assign_resou... (#450793)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A

How to set up and validate locally

This is actually very hard to replicate locally and, so far, I've only observed this in this example group: https://gitlab.com/dkua1_ultimate_group/private/zd/gitlab-job-stuck-at-waiting-status-460524.

I would suggest just testing that there are no errors happening with this setup:

  1. Create a project

  2. Add a .gitlab-ci.yml and child .deploy.yml pipeline configuration (see example above)

  3. Run the pipeline several times with the Feature Flag respawn_assign_resource_worker enabled and disabled.

    .gitlab-ci.yml
    # note: the script here is just based on a user-reported issue with this particular problem,
    #       where there is a job that changes the resource_group's process_mode during pipeline execution,
    #       further exacerbating the possibility of the race condition happening
    build:
      stage: build
      resource_group: "resource_group_1"
      script:
      - apk add --no-cache curl
      - |
        curl "https://gdk.test:3443/api/v4/projects/32/resource_groups/resource_group_1" \
          -k -X PUT \
          --header "Authorization: Bearer <the-personal-access-token>" \
          --data "{\"process_mode\": \"oldest_first\"}"
    
    deploy:
      stage: deploy
      resource_group: "resource_group_1"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    .deploy.yml child pipeline configuration
    deploy:
      stage: deploy
      script:
        - echo "DEPLOY"
      environment:
        name: production
        action: start
    
    deploy2:
      stage: deploy
      script:
        - echo "DEPLOY2"
      environment:
        name: production2
        action: start
    
    deploy3:
      stage: deploy
      script:
        - echo "DEPLOY3"
      environment:
        name: production3
        action: start
      
    deploy4:
      stage: deploy
      script:
        - echo "DEPLOY4"
      environment:
        name: production4
        action: start
    
    deploy5:
      stage: deploy
      script:
        - echo "DEPLOY5"
      environment:
        name: production5
        action: start

Related to #436988

Edited by Pam Artiaga

Merge request reports