Temporarily unreachable kubernetes master leads to orphaned job pods

Summary

If the kubernetes master is unreachable for some time (like ~30seconds), jobs terminate with the error below. In some cases (probably depending on the job itself and/or the state it is in) this leaves an orphaned job pod behind that is never cleaned up.

Maybe it would be smart to have some kind of retry/exponential backoff. At least in GKE environments it is not uncommon for the kubernetes master to be rebootet (node-pools added/removed, config changes, master upgrades etc.).

Steps to reproduce

  • Start a (longer running) job
  • Shutdown/cut off the kubernetes master

Relevant logs and/or screenshots

Running with gitlab-runner 10.8.0 (079aad9e)
  on Kubernetes Runner 7974b466
Using Kubernetes namespace: gitlab-runner-jobs
Using Kubernetes executor with image docker:1.11 ...
Waiting for pod gitlab-runner-jobs/runner-7974b466-project-468-concurrent-0f7hwj to be running, status is Pending
Running on runner-7974b466-project-468-concurrent-0f7hwj via gitlab-runner-75b464d954-b76db...
Cloning repository...
...
...
ERROR: Job failed (system failure): Get https://<Kubernetes Master IP>:443/api/v1/namespaces/gitlab-runner-jobs/pods/runner-7974b466-project-468-concurrent-0f7hwj: dial tcp <Kubernetes Master IP>:443: getsockopt: connection refused
gitlab-runner-75b464d954-b76db gitlab-runner 2018-05-25T09:14:33.566800868Z ERROR: Job failed (system failure): Get https://<Kubernetes Master IP>:443/api/v1/namespaces/gitlab-runner-jobs/pods/runner-7974b466-project-468-concurrent-0f7hwj: dial tcp <Kubernetes Master IP>:443: getsockopt: connection refused  job=159923 project=468 runner=7974b466
gitlab-runner-75b464d954-b76db gitlab-runner 2018-05-25T09:14:34.907593939Z ERROR: Error cleaning up pod: Delete https://<Kubernetes Master IP>:443/api/v1/namespaces/gitlab-runner-jobs/pods/runner-7974b466-project-468-concurrent-0f7hwj: dial tcp <Kubernetes Master IP>:443: getsockopt: connection refused  job=159923 project=468 runner=7974b466

Environment description

Runners are on GKE Kubernetes 1.8.10

Used GitLab Runner version

Running with gitlab-runner 10.8.0 (079aad9e)

Kubernetes executor.