When gitlab-runner is stopped/rescheduled/crashes, child jobs hang indefinitely

Summary

When using the Kubernetes executor:

If the controller pod is stopped (deleted, crashes, rescheduled to a different machine by Kubernetes), any jobs it was currently running will hang reading from stdin, indefinitely.

Steps to reproduce

These steps assume a running Kubernetes cluster, GitLab CI runner, and a project on which you can start pipeline runs.

The cluster

Notice the the controller pod has been running for a while and it has just started a new job:

$ kubectl get -n gitlab-ci pods
NAME                                            READY   STATUS    RESTARTS   AGE
gitlab-runner-d799dd6d4-wkrbl                   1/1     Running   0          4d20h
runner-mxszx5s-project-5674-concurrent-0mzhr5   6/6     Running   0          12s

Delete the controller pod

$ kubectl delete -n gitlab-ci pods gitlab-runner-d799dd6d4-wkrbl
pod "gitlab-runner-d799dd6d4-wkrbl" deleted

Check the state of the job

Notice that the job still exists after deleting the old controller pod. Also notice that the Kubernetes replica set / deployment has recreated the controller pod.

$ kubectl get -n gitlab-ci pods
NAME                                            READY   STATUS    RESTARTS   AGE
gitlab-runner-d799dd6d4-7984x                   1/1     Running   0          36s
runner-mxszx5s-project-5674-concurrent-0mzhr5   6/6     Running   0          56s

Next, we attach to the runner pod and try to find out what it's doing.

$ kubectl exec -ti -n gitlab-ci runner-mxszx5s-project-5674-concurrent-0mzhr5 /bin/bash

bash-4.2# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  11700  2580 ?        Ss   14:28   0:00 /usr/bin/bash
root       507  0.0  0.0  11836  3036 pts/0    Ss   14:30   0:00 /bin/bash
root       624  0.0  0.0  51760  3604 pts/0    R+   14:38   0:00 ps aux
bash-4.2# strace -p 1
strace: Process 1 attached
read(0,

You can see here, PID 1 of the pod is attempting to read from stdin (file descriptor 0) and will never complete.

Actual behavior

The job keeps running, attempting to read from stdin until manually killed.

Expected behavior

The job pod should be deleted whenever the controller pod is deleted. This can be accomplished by using the ownerReferences feature of Kubernetes when the controller creates new pods.

Example of where this could be set: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/59abc3d324882816618e7372dd85f6adb4d6c6b3/executors/kubernetes/kubernetes.go#L872

Environment description

This is a custom installation using the Kubernetes executor.

config.toml contents

    concurrent = 20
    check_interval = 30
    log_level = "info"
    listen_address = '[::]:9252'

Used GitLab Runner version

bash-4.4$ gitlab-runner --version
Version:      12.2.0
Git revision: a987417a
Git branch:   12-2-stable
GO version:   go1.8.7
Built:        2019-08-22T13:06:00+0000
OS/Arch:      linux/amd64

Possible fixes

This could be fixed by using the ownerReferences feature of Kubernetes when the controller creates new pods.

Example of where this could be set: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/59abc3d324882816618e7372dd85f6adb4d6c6b3/executors/kubernetes/kubernetes.go#L872

https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/

Edited Apr 13, 2020 by Nick Pillitteri