Jobs hanging when a child pipeline job depends on a parent pipeline job
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
We have a parent-child pipeline set up in one of our larger projects, and recently one of our engineers added a dependency from a job in a child pipeline onto a job in the parent pipeline via the needs:pipeline:job
keyword. This caused the child job to hang in the Gitlab UI after about 5 minutes of processing.
We mostly run Kubernetes executors, and we could see that the job pod was being terminated at the time the logs stopped moving in the Gitlab UI. The kubelet logs showed it to be a normal pod termination, not an error state.
Steps to reproduce
- Create a parent-child pipeline
- Set a job in the child pipeline to depend on a job in the parent pipeline via
needs:pipeline:job
- Ensure that child job runs for more than 5 minutes
.gitlab-ci.yml
# .gitlab-ci.yml
stages: [prepare, trigger]
artifact-job:
image: alpine:3.21
stage: prepare
script:
- echo "test" > file.txt
artifacts:
paths:
- file.txt
trigger-child-pipeline:
stage: trigger
interruptible: true
trigger:
strategy: depend
include: child.yml
variables:
PARENT_PIPELINE_ID: $CI_PIPELINE_ID
# child.yml
stages: [simple]
child-sleep-job:
image: alpine:3.21
stage: simple
script:
- apk add bash
- chmod +x ${CI_PROJECT_DIR}/sleep.sh
- bash -c ${CI_PROJECT_DIR}/sleep.sh
interruptible: true
needs:
- pipeline: $PARENT_PIPELINE_ID
job: artifact-job
And the sleep.sh script simply sleeps for 10 minutes while outputting a log line every 5s:
#!/bin/bash
for i in {1..120}; do
echo "Waiting... $((i * 5))s elapsed";
sleep 5;
done
Actual behavior
The child job logs stop updating after ~5 minutes as the job has been terminated on the runner, but the job continues showing "In Progress" in the Gitlab UI until it is cancelled or times out.
Expected behavior
The child job should complete successfully.
Relevant logs and/or screenshots
I pulled this log from the runner-manager pod showing that it does attempt to update Gitlab that the job has failed, but gets a 403.
job log
2025-06-09T04:04:32.560595368Z Appending trace to coordinator...ok code=202 job=10289549944 job-log=0-8959 job-status=running runner=__KRmkp3S sent-log=8903-8958 status=202 Accepted update-interval=3s
2025-06-09T04:04:36.082587213Z Appending trace to coordinator...ok code=202 job=10289549944 job-log=0-9015 job-status=running runner=__KRmkp3S sent-log=8959-9014 status=202 Accepted update-interval=3s
2025-06-09T04:04:42.511663949Z WARNING: Appending trace to coordinator... job failed code=403 job=10289549944 job-log= job-status= runner=__KRmkp3S sent-log=9015-9070 status=403 Forbidden update-interval=0s
2025-06-09T04:04:42.511915152Z WARNING: after_script failed, but job will continue unaffected: context canceled job=10289549944 project=59934890 runner=__KRmkp3S
2025-06-09T04:04:42.511926979Z WARNING: Error while executing file based variables removal script error=context canceled job=10289549944 project=59934890 runner=__KRmkp3S
2025-06-09T04:04:42.537898926Z WARNING: Job failed: canceled
2025-06-09T04:04:42.537926754Z duration_s=361.375103085 job=10289549944 project=59934890 runner=__KRmkp3S
2025-06-09T04:04:43.011023057Z WARNING: Appending trace to coordinator... job failed code=403 job=10289549944 job-log= job-status= runner=__KRmkp3S sent-log=9015-9372 status=403 Forbidden update-interval=0s
2025-06-09T04:04:43.011064897Z Updating job... bytesize=9373 checksum=crc32:ff042edd job=10289549944 runner=__KRmkp3S
2025-06-09T04:04:43.877704676Z WARNING: Submitting job to coordinator... job failed bytesize=9373 checksum=crc32:ff042edd code=403 job=10289549944 job-status= runner=__KRmkp3S status=PUT https://gitlab.com/api/v4/jobs/10289549944: 403 Forbidden update-interval=0s
2025-06-09T04:04:43.877905346Z Removed job from processing list builds=0 job=10289549944 max_builds=100 project=59934890 repo_url=https://gitlab.com/REDACTED/test-project.git time_in_queue_seconds=2
Environment description
We have reproduced this issue on our kubernetes executor runners and some EC2 runners using the docker executor and autoscaling fleeting, as well as the Gitlab SaaS runners.
In addition, the issue was reproducible on the following runner helm chart versions:
0.77.2
0.76.2
0.75.1
0.74.3
0.73.5
values.yaml contents
image:
registry: registry.gitlab.com
image: gitlab-org/gitlab-runner
tag: alpine-v{{.Chart.AppVersion}}
imagePullPolicy: IfNotPresent
replicas: 1
revisionHistoryLimit: 1
gitlabUrl: https://gitlab.com/
unregisterRunners: true
terminationGracePeriodSeconds: 3600
concurrent: 100
shutdown_timeout: 0
checkInterval: 3
logLevel: info
sessionServer:
enabled: false
rbac:
create: true
rules:
- apiGroups: [""]
resources: ["events"]
verbs:
- "list"
- "watch" # Required when FF_PRINT_POD_EVENTS=true
- apiGroups: [""]
resources: ["namespaces"]
verbs:
- "create" # Required when kubernetes.NamespacePerJob=true
- "delete" # Required when kubernetes.NamespacePerJob=true
- apiGroups: [""]
resources: ["pods"]
verbs:
- "create"
- "delete"
- "get"
- "list" # Required when FF_USE_INFORMERS=true
- "watch" # Required when FF_KUBERNETES_HONOR_ENTRYPOINT=true, FF_USE_INFORMERS=true, FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
- apiGroups: [""]
resources: ["pods/attach"]
verbs:
- "create" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
- "delete" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
- "get" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
- "patch" # Required when FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
- apiGroups: [""]
resources: ["pods/exec"]
verbs:
- "create"
- "delete"
- "get"
- "patch"
- apiGroups: [""]
resources: ["pods/log"]
verbs:
- "get" # Required when FF_KUBERNETES_HONOR_ENTRYPOINT=true, FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false, FF_WAIT_FOR_POD_TO_BE_REACHABLE=true
- "list" # Required when FF_KUBERNETES_HONOR_ENTRYPOINT=true, FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
- apiGroups: [""]
resources: ["secrets"]
verbs:
- "create"
- "delete"
- "get"
- "update"
- apiGroups: [""]
resources: ["serviceaccounts"]
verbs:
- "get"
- apiGroups: [""]
resources: ["services"]
verbs:
- "create"
- "get"
clusterWideAccess: false
podSecurityPolicy:
enabled: false
resourceNames:
- gitlab-runner
serviceAccount:
create: false
name: "gitlab-runner"
metrics:
enabled: true
portName: metrics
port: 9252
serviceMonitor:
enabled: true
service:
enabled: true
type: ClusterIP
runners:
executor: kubernetes
name: "{{.Release.Name}}"
secret: "{{.Release.Name}}-secret"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
runAsNonRoot: true
privileged: false
capabilities:
drop: ["ALL"]
# allow: ["ALL"]
podSecurityContext:
runAsUser: 100
# runAsGroup: 65533
fsGroup: 65533
# supplementalGroups: [65533]
resources: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/nodepool
operator: DoesNotExist
topologySpreadConstraints: {}
nodeSelector: {}
tolerations:
- key: "node-role.kubernetes.io/worker"
operator: "Exists"
config.toml contents
[[runners]]
name = "gitlab-runner-debug"
environment = ["FF_ENABLE_JOB_CLEANUP=True"\,"FF_TIMESTAMPS=True"\,"DOCKER_HOST=tcp://docker:2376"\,"DOCKER_TLS_CERTDIR=/certs"\,"DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client"\,"DOCKER_TLS_VERIFY=1"]
[runners.kubernetes]
namespace = "{{.Release.Namespace}}"
image = "alpine:3.21"
tls_verify = true
privileged = true
services_privileged = true
cpu_request = "1"
memory_limit = "4Gi"
helper_cpu_limit = "1"
helper_memory_limit = "1Gi"
service_cpu_limit = "1"
service_memory_limit = "2Gi"
[[runners.kubernetes.volumes.empty_dir]]
name = "docker-certs"
mount_path = "/certs/client"
medium = "Memory"
[runners.kubernetes.pod_annotations]
"karpenter.sh/do-not-disrupt" = "true"
"prometheus.io/scrape" = "true"
[runners.kubernetes.node_selector]
"karpenter.sh/nodepool" = "default"
[runners.custom_build_dir]
enabled = true
[runners.cache]
Type = "s3"
Path = "gitlab-runner-cache"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.ap-southeast-2.amazonaws.com"
BucketLocation = "ap-southeast-2"
BucketName = "${GITLAB_CACHE_BUCKET}"
Insecure = false
ServerSideEncryption = "S3"
Used GitLab Runner version
Running with gitlab-runner 17.8.5 (c9164c8c)
Running with gitlab-runner 17.9.3 (f6927248)
Running with gitlab-runner 17.10.1 (ef334dcc)
Running with gitlab-runner 17.11.2 (db89d497)
Running with gitlab-runner 18.0.2 (4d7093e1)