Kubernetes executor: ERROR: Job failed (system failure): prepare environment: setting up build pod: etcdserver: request timed out.
Summary
On occasion, we see this intermittent failure related to the executor making requests to the Kubernetes API. I would expect any intermittent failure like a timeout should leverage a backoff/retry, rather than fail the job immediately.
We can see there are non-zero errors at the kube-apiserver, but would expect that gitlab-runner could recover from this with a short backoff and retry.
Steps to reproduce
Steps are difficult to reproduce, as the issue is intermittent. Our pipelines are fairly large, 650 jobs in a single pipeline. The error is not specific to any job/runner.
Actual behavior
Job fails
Expected behavior
Job waits for poll_timeout
seconds before failing the job.
Relevant logs and/or screenshots
runner log
{
"duration_s": 13.085811169,
"job": 18290279,
"level": "error",
"msg": "Job failed (system failure): prepare environment: setting up build pod: etcdserver: request timed out. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information",
"project": 3928,
"runner": "rHxGzYhb",
"time": "2023-03-07T19:49:35Z"
}
job log
Running with gitlab-runner 15.1.0~beta.1.gb55b1e56 (b55b1e56)
on vt-firmware-ram-gitlab-runner-64c56ccf95-kr2rq rHxGzYhb
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
"CPURequest" overwritten with "2"
"MemoryRequest" overwritten with "4Gi"
"CPULimit" overwritten with "2"
"MemoryLimit" overwritten with "4Gi"
"HelperCPURequest" overwritten with "500m"
"HelperMemoryRequest" overwritten with "10Gi"
"HelperCPULimit" overwritten with "2"
"HelperMemoryLimit" overwritten with "10Gi"
Using Kubernetes namespace: vt-firmware-ram
Using Kubernetes executor with image xxx ...
Using attach strategy to execute scripts...
Preparing environment
00:07
ERROR: Job failed (system failure): prepare environment: setting up build pod: etcdserver: request timed out. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Environment description
Running self-managed GitLab.
config.toml contents
config: |
[[runners]]
environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
[runners.kubernetes]
image = "ubuntu:20.04"
poll_timeout = 900
poll_interval = 20
pull_policy = ["always", "always", "always", "if-not-present"]
resource_availability_check_max_attempts = 0 # Disable the check altogether, fallback to default polling on pod readiness
[runners.kubernetes.pod_annotations]
"karpenter.sh/do-not-evict" = "true"
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
[runners.kubernetes.node_selector]
"vt.goriv.co/runners" = "true"
"kubernetes.io/arch" = "{{ .Values.runner_architecture | default "amd64" }}"
"kubernetes.io/os" = "{{ .Values.runner_os | default "linux" }}"
Used GitLab Runner version
15.1