Skip to content

resource_group deadlock - child pipeline disappeared, did not release lock

I'm using GitLab SaaS with runner version 18.4.0 on a local server.

For anyone with appropriate permissions to view: https://gitlab.com/sqc-eng/esys/mono/-/pipelines/2100054628

This child pipeline takes a resource_group as a lock (so that only one instance of this child pipeline can run at a time):

image

This pipeline ran over 12 hours ago. It is now stuck in a "running" state. As you can see, a job qps-test-warm-boot-job completed successfully, but then the next job integration-test-job is stuck in a "created" state, and does not run.

The GraphQL query for this job returns:

{
  "data": {
    "project": {
      "name": "Mono",
      "job": {
        "active": false,
        "name": "integration-test-job",
        "createdAt": "2025-10-15T04:57:39Z",
        "duration": null,
        "queuedDuration": null,
        "queuedAt": null,
        "exitCode": null,
        "finishedAt": null,
        "kind": "BUILD",
        "pipeline": {
          "id": "gid://gitlab/Ci::Pipeline/2100116387"
        },
        "runner": null,
        "status": "CREATED",
        "stuck": false,
        "detailedStatus": {
          "action": {
            "path": "/sqc-eng/esys/mono/-/jobs/11720010564/cancel",
            "buttonTitle": "Cancel this job"
          }
        }
      }
    }
  },
  "correlationId": "98fb3904d06d725f-AKL"
}

The query on the child pipeline this job belongs to returns null - this pipeline has disappeared from the API. However, the parent pipeline is still running, waiting for a child pipeline that has disappeared and will never complete:

{
  "data": {
    "project": {
      "name": "Mono",
      "pipeline": {
        "iid": "33628",
        "type": "merged_result",
        "active": true,
        "child": false,
        "path": "/sqc-eng/esys/mono/-/pipelines/2100054628",
        "complete": false,
        "createdAt": "2025-10-15T03:59:55Z",
        "updatedAt": "2025-10-15T04:56:49Z",
        "detailedStatus": {
          "id": "running-2100054628-2100054628"
        },
        "duration": null,
        "errorMessages": {
          "edges": []
        },
        "failureReason": null,
        "queuedDuration": 6,
        "status": "RUNNING",
        "stuck": false
      }
    }
  },
  "correlationId": "98fb3dbae035725f-ATL"
}

I've checked the resource_group via the REST API and it shows that the resource_group exists, and that it's currently held by the parent pipeline (I assume the lock is taken by the parent pipeline's bridge job just before it spawns the child pipeline, so this makes sense).

The system is now deadlocked! The resource_group (lock) is held by a pipeline with a disappeared child pipeline, that will never complete, so the parent's bridge job will wait forever, and the resource_group lock will never be released. This deadlocks the pipeline and now I have ~50 overnight jobs waiting forever to spawn their own child pipelines.

I will have to restart everything to keep people working, but maybe this information helps if others see similar behaviour.

Edited by 🤖 GitLab Bot 🤖