Temporarily unreachable kubernetes master leads to orphaned job pods
Summary
If the kubernetes master is unreachable for some time (like ~30seconds), jobs terminate with the error below. In some cases (probably depending on the job itself and/or the state it is in) this leaves an orphaned job pod behind that is never cleaned up.
Maybe it would be smart to have some kind of retry/exponential backoff. At least in GKE environments it is not uncommon for the kubernetes master to be rebootet (node-pools added/removed, config changes, master upgrades etc.).
Steps to reproduce
- Start a (longer running) job
- Shutdown/cut off the kubernetes master
Relevant logs and/or screenshots
Running with gitlab-runner 10.8.0 (079aad9e)
on Kubernetes Runner 7974b466
Using Kubernetes namespace: gitlab-runner-jobs
Using Kubernetes executor with image docker:1.11 ...
Waiting for pod gitlab-runner-jobs/runner-7974b466-project-468-concurrent-0f7hwj to be running, status is Pending
Running on runner-7974b466-project-468-concurrent-0f7hwj via gitlab-runner-75b464d954-b76db...
Cloning repository...
...
...
ERROR: Job failed (system failure): Get https://<Kubernetes Master IP>:443/api/v1/namespaces/gitlab-runner-jobs/pods/runner-7974b466-project-468-concurrent-0f7hwj: dial tcp <Kubernetes Master IP>:443: getsockopt: connection refused
gitlab-runner-75b464d954-b76db gitlab-runner 2018-05-25T09:14:33.566800868Z ERROR: Job failed (system failure): Get https://<Kubernetes Master IP>:443/api/v1/namespaces/gitlab-runner-jobs/pods/runner-7974b466-project-468-concurrent-0f7hwj: dial tcp <Kubernetes Master IP>:443: getsockopt: connection refused job=159923 project=468 runner=7974b466
gitlab-runner-75b464d954-b76db gitlab-runner 2018-05-25T09:14:34.907593939Z ERROR: Error cleaning up pod: Delete https://<Kubernetes Master IP>:443/api/v1/namespaces/gitlab-runner-jobs/pods/runner-7974b466-project-468-concurrent-0f7hwj: dial tcp <Kubernetes Master IP>:443: getsockopt: connection refused job=159923 project=468 runner=7974b466
Environment description
Runners are on GKE Kubernetes 1.8.10
Used GitLab Runner version
Running with gitlab-runner 10.8.0 (079aad9e)
Kubernetes executor.