You need to sign in or sign up before continuing.

Jobs hanging when a child pipeline job depends on a parent pipeline job

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Summary

We have a parent-child pipeline set up in one of our larger projects, and recently one of our engineers added a dependency from a job in a child pipeline onto a job in the parent pipeline via the needs:pipeline:job keyword. This caused the child job to hang in the Gitlab UI after about 5 minutes of processing.

We mostly run Kubernetes executors, and we could see that the job pod was being terminated at the time the logs stopped moving in the Gitlab UI. The kubelet logs showed it to be a normal pod termination, not an error state.

Steps to reproduce

Create a parent-child pipeline
Set a job in the child pipeline to depend on a job in the parent pipeline via needs:pipeline:job
Ensure that child job runs for more than 5 minutes

.gitlab-ci.yml

# .gitlab-ci.yml
stages: [prepare, trigger]

artifact-job:
  image: alpine:3.21
  stage: prepare
  script:
    - echo "test" > file.txt
  artifacts:
    paths:
      - file.txt

trigger-child-pipeline:
  stage: trigger
  interruptible: true
  trigger:
    strategy: depend
    include: child.yml
  variables:
    PARENT_PIPELINE_ID: $CI_PIPELINE_ID

# child.yml
stages: [simple]

child-sleep-job:
  image: alpine:3.21
  stage: simple
  script:
    - apk add bash
    - chmod +x ${CI_PROJECT_DIR}/sleep.sh
    - bash -c ${CI_PROJECT_DIR}/sleep.sh
  interruptible: true
  needs:
    - pipeline: $PARENT_PIPELINE_ID
      job: artifact-job

And the sleep.sh script simply sleeps for 10 minutes while outputting a log line every 5s:

#!/bin/bash

for i in {1..120}; do 
    echo "Waiting... $((i * 5))s elapsed"; 
    sleep 5; 
done

Actual behavior

The child job logs stop updating after ~5 minutes as the job has been terminated on the runner, but the job continues showing "In Progress" in the Gitlab UI until it is cancelled or times out.

Expected behavior

The child job should complete successfully.

Relevant logs and/or screenshots

I pulled this log from the runner-manager pod showing that it does attempt to update Gitlab that the job has failed, but gets a 403.

job log

2025-06-09T04:04:32.560595368Z Appending trace to coordinator...ok                 code=202 job=10289549944 job-log=0-8959 job-status=running runner=__KRmkp3S sent-log=8903-8958 status=202 Accepted update-interval=3s
2025-06-09T04:04:36.082587213Z Appending trace to coordinator...ok                 code=202 job=10289549944 job-log=0-9015 job-status=running runner=__KRmkp3S sent-log=8959-9014 status=202 Accepted update-interval=3s
2025-06-09T04:04:42.511663949Z WARNING: Appending trace to coordinator... job failed  code=403 job=10289549944 job-log= job-status= runner=__KRmkp3S sent-log=9015-9070 status=403 Forbidden update-interval=0s
2025-06-09T04:04:42.511915152Z WARNING: after_script failed, but job will continue unaffected: context canceled  job=10289549944 project=59934890 runner=__KRmkp3S
2025-06-09T04:04:42.511926979Z WARNING: Error while executing file based variables removal script  error=context canceled job=10289549944 project=59934890 runner=__KRmkp3S
2025-06-09T04:04:42.537898926Z WARNING: Job failed: canceled
2025-06-09T04:04:42.537926754Z                       duration_s=361.375103085 job=10289549944 project=59934890 runner=__KRmkp3S
2025-06-09T04:04:43.011023057Z WARNING: Appending trace to coordinator... job failed  code=403 job=10289549944 job-log= job-status= runner=__KRmkp3S sent-log=9015-9372 status=403 Forbidden update-interval=0s
2025-06-09T04:04:43.011064897Z Updating job...                                     bytesize=9373 checksum=crc32:ff042edd job=10289549944 runner=__KRmkp3S
2025-06-09T04:04:43.877704676Z WARNING: Submitting job to coordinator... job failed  bytesize=9373 checksum=crc32:ff042edd code=403 job=10289549944 job-status= runner=__KRmkp3S status=PUT https://gitlab.com/api/v4/jobs/10289549944: 403 Forbidden update-interval=0s
2025-06-09T04:04:43.877905346Z Removed job from processing list                    builds=0 job=10289549944 max_builds=100 project=59934890 repo_url=https://gitlab.com/REDACTED/test-project.git time_in_queue_seconds=2

Environment description

We have reproduced this issue on our kubernetes executor runners and some EC2 runners using the docker executor and autoscaling fleeting, as well as the Gitlab SaaS runners.

In addition, the issue was reproducible on the following runner helm chart versions:

0.77.2
0.76.2
0.75.1
0.74.3
0.73.5

values.yaml contents

image:
  registry: registry.gitlab.com
  image: gitlab-org/gitlab-runner
  tag: alpine-v{{.Chart.AppVersion}}

imagePullPolicy: IfNotPresent
replicas: 1
revisionHistoryLimit: 1
gitlabUrl: https://gitlab.com/

unregisterRunners: true
terminationGracePeriodSeconds: 3600
concurrent: 100
shutdown_timeout: 0
checkInterval: 3
logLevel: info

sessionServer:
  enabled: false

rbac:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["events"]
      verbs:
        - "list"
        - "watch" # Required when FF_PRINT_POD_EVENTS=true
    - apiGroups: [""]
      resources: ["namespaces"]
      verbs:
        - "create" # Required when kubernetes.NamespacePerJob=true
        - "delete" # Required when kubernetes.NamespacePerJob=true
    - apiGroups: [""]
      resources: ["pods"]
      verbs:
        - "create"
        - "delete"
        - "get"
        - "list" # Required when FF_USE_INFORMERS=true
        - "watch" # Required when FF_KUBERNETES_HONOR_ENTRYPOINT=true, FF_USE_INFORMERS=true, FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
    - apiGroups: [""]
      resources: ["pods/attach"]
      verbs:
        - "create" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
        - "delete" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
        - "get" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
        - "patch" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
    - apiGroups: [""]
      resources: ["pods/exec"]
      verbs:
        - "create"
        - "delete"
        - "get"
        - "patch"
    - apiGroups: [""]
      resources: ["pods/log"]
      verbs:
        - "get" # Required when FF_KUBERNETES_HONOR_ENTRYPOINT=true, FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false, FF_WAIT_FOR_POD_TO_BE_REACHABLE=true
        - "list" # Required when FF_KUBERNETES_HONOR_ENTRYPOINT=true, FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
    - apiGroups: [""]
      resources: ["secrets"]
      verbs:
        - "create"
        - "delete"
        - "get"
        - "update"
    - apiGroups: [""]
      resources: ["serviceaccounts"]
      verbs:
        - "get"
    - apiGroups: [""]
      resources: ["services"]
      verbs:
        - "create"
        - "get"

  clusterWideAccess: false
  podSecurityPolicy:
    enabled: false
    resourceNames:
      - gitlab-runner

serviceAccount:
  create: false
  name: "gitlab-runner"

metrics:
  enabled: true
  portName: metrics
  port: 9252
  serviceMonitor:
    enabled: true

service:
  enabled: true
  type: ClusterIP

runners:
  executor: kubernetes
  name: "{{.Release.Name}}"
  secret: "{{.Release.Name}}-secret"

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: false
  runAsNonRoot: true
  privileged: false
  capabilities:
    drop: ["ALL"]
    # allow: ["ALL"]

podSecurityContext:
  runAsUser: 100
  # runAsGroup: 65533
  fsGroup: 65533
  # supplementalGroups: [65533]

resources: {}
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: karpenter.sh/nodepool
              operator: DoesNotExist

topologySpreadConstraints: {}
nodeSelector: {}
tolerations:
  - key: "node-role.kubernetes.io/worker"
    operator: "Exists"

config.toml contents

[[runners]]
  name = "gitlab-runner-debug"
  environment = ["FF_ENABLE_JOB_CLEANUP=True"\,"FF_TIMESTAMPS=True"\,"DOCKER_HOST=tcp://docker:2376"\,"DOCKER_TLS_CERTDIR=/certs"\,"DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client"\,"DOCKER_TLS_VERIFY=1"]
  [runners.kubernetes]
    namespace = "{{.Release.Namespace}}"
    image = "alpine:3.21"
    tls_verify = true
    privileged = true
    services_privileged = true
    cpu_request = "1"
    memory_limit = "4Gi"
    helper_cpu_limit = "1"
    helper_memory_limit = "1Gi"
    service_cpu_limit = "1"
    service_memory_limit = "2Gi"
  [[runners.kubernetes.volumes.empty_dir]]
    name = "docker-certs"
    mount_path = "/certs/client"
    medium = "Memory"
  [runners.kubernetes.pod_annotations]
    "karpenter.sh/do-not-disrupt" = "true"
    "prometheus.io/scrape" = "true"
  [runners.kubernetes.node_selector]
    "karpenter.sh/nodepool" = "default"
  [runners.custom_build_dir]
    enabled = true
  [runners.cache]
    Type = "s3"
    Path = "gitlab-runner-cache"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "s3.ap-southeast-2.amazonaws.com"
      BucketLocation = "ap-southeast-2"
      BucketName = "${GITLAB_CACHE_BUCKET}"
      Insecure = false
      ServerSideEncryption = "S3"

Used GitLab Runner version

Running with gitlab-runner 17.8.5 (c9164c8c)
Running with gitlab-runner 17.9.3 (f6927248)
Running with gitlab-runner 17.10.1 (ef334dcc)
Running with gitlab-runner 17.11.2 (db89d497)
Running with gitlab-runner 18.0.2 (4d7093e1)

Possible fixes

Edited Jul 10, 2025 by 🤖 GitLab Bot 🤖