Kubernetes executor isn't resilient to transient API failures
Summary
When the Kubernetes cluster has a transient issue (e.g. the etcd cluster is in the middle of a leader election, due to instability or rolling deployments), the executor bails, but a retry would have succeeded, and not needed to fail the job.
Steps to reproduce
Run the kubernetes executor, trigger lots of jobs, trigger an etcd leader election, watch some fail
Actual behavior
Job fails
Expected behavior
Executor retries the API call, job continues
Relevant logs and/or screenshots
runner:
Checking for jobs... received job=2589916 repo_url=https://gitlab.tech.lastmile.com/foo/bar.git runner=3f8cc012
ERROR: Job failed (system failure): client: etcd member http://127.0.0.1:2379 has no leader job=2589916 project=3106 runner=3f8cc012
build:
Running with gitlab-runner 10.6.0~beta.377.gbe12a386 (be12a386)
on ospcfc-kubernetes-devtools-500mcpu-1GiB (3f8cc012)
Using Kubernetes namespace: gitlabrunnerospcfc
Using Kubernetes executor with image mirror-internal.docker.tech.lastmile.com/kubernetes/kubernetes-manifest-deploy:5.2.0@sha256:4aead3fd52b75c3c3d0f01be31d9058b1cd641a3bb1a761ff1d555babf18681c ...
ERROR: Job failed (system failure): client: etcd member http://127.0.0.1:2379 has no leader
Environment description
Private installation. Gitlab 10.3
Used GitLab Runner version
Private installation. Running off master branch + !652 (closed)