Error dialing backend: connection refused (AWS EKS)
Frequently, we'll see that a runner job fails with an error like this:
ERROR: Job failed (system failure): error dialing backend: dial tcp 10.252.177.181:10250: connect: connection refused
We've been able to correlate this error with node scale down events triggered by the Cluster Autoscaler. We tried setting the relevant annotation for both the runner and builder pods to avoid pod eviction (
cluster-autoscaler.kubernetes.io/safe-to-evict: false) but that didn't help. We've also seen this issue happen and verified that the failing GitLab pod stayed on the same node during the scale-down event.
We logged an issue with AWS and after some debugging with them they said they believe the issue is likely a problem with either the GitLab runner or the node autoscaler (not EKS in general).
I've commended on Issue 3247 since the error seems almost identical.
Steps to reproduce
Enable Kubernetes runners on EKS along with the Cluster Autoscaler. Frequently, various jobs will fail with the aforementioned error message.
The GitLab CI jobs frequently fail with the aforementioned error.
The jobs should not fail.
Relevant logs and/or screenshots
... ERROR: Job failed (system failure): error dialing backend: dial tcp <IP>:<PORT>: connect: connection refused
chart: repository: https://charts.gitlab.io/ name: gitlab-runner version: 0.13.1
Used GitLab Runner version
Running with gitlab-runner 12.7.1 (003fe500) on gitlab-runner-gitlab-runner-7c89576bb6-tmdl7 yFRhHxSU Using Kubernetes namespace: gitlab Using Kubernetes executor with image 696398453447.dkr.ecr.us-west-2.amazonaws.com/gitlab-runner:1.113.0 ... ...