Kubernetes executor: ERROR: Job failed (system failure): prepare environment: setting up build pod: etcdserver: request timed out.

Problem

Intermittent failure in GitLab Runner's Kubernetes executor when making requests to the Kubernetes API server.

  • No retry logic is implemented, causing jobs to fail on first API request failure
  • This makes the GitLab CI/CD pipeline brittle and unreliable due to temporary infrastructure hiccups

Summary - original issue submission

On occasion, we see this intermittent failure related to the executor making requests to the Kubernetes API. I would expect any intermittent failure like a timeout should leverage a backoff/retry, rather than fail the job immediately.

We can see there are non-zero errors at the kube-apiserver, but would expect that gitlab-runner could recover from this with a short backoff and retry.

screencapture-grafana-kube-apiserver

Steps to reproduce

Steps are difficult to reproduce, as the issue is intermittent. Our pipelines are fairly large, 650 jobs in a single pipeline. The error is not specific to any job/runner.

Actual behavior

Job fails

Expected behavior

Job waits for poll_timeout seconds before failing the job.

Relevant logs and/or screenshots

runner log
{
        "duration_s": 13.085811169,
        "job": 18290279,
        "level": "error",
        "msg": "Job failed (system failure): prepare environment: setting up build pod: etcdserver: request timed out. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information",
        "project": 3928,
        "runner": "rHxGzYhb",
        "time": "2023-03-07T19:49:35Z"
    }
job log
Running with gitlab-runner 15.1.0~beta.1.gb55b1e56 (b55b1e56)
  on vt-firmware-ram-gitlab-runner-64c56ccf95-kr2rq rHxGzYhb
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
"CPURequest" overwritten with "2"
"MemoryRequest" overwritten with "4Gi"
"CPULimit" overwritten with "2"
"MemoryLimit" overwritten with "4Gi"
"HelperCPURequest" overwritten with "500m"
"HelperMemoryRequest" overwritten with "10Gi"
"HelperCPULimit" overwritten with "2"
"HelperMemoryLimit" overwritten with "10Gi"
Using Kubernetes namespace: vt-firmware-ram
Using Kubernetes executor with image xxx ...
Using attach strategy to execute scripts...
Preparing environment
00:07
ERROR: Job failed (system failure): prepare environment: setting up build pod: etcdserver: request timed out. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Environment description

Running self-managed GitLab.

config.toml contents
  config: |
    [[runners]]
      environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
        [runners.kubernetes]
          image = "ubuntu:20.04"          
          poll_timeout = 900
          poll_interval = 20
          pull_policy = ["always", "always", "always", "if-not-present"]
          resource_availability_check_max_attempts = 0  # Disable the check altogether, fallback to default polling on pod readiness
        [runners.kubernetes.pod_annotations]
          "karpenter.sh/do-not-evict" = "true"
          "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
        [runners.kubernetes.node_selector]
          "vt.goriv.co/runners" = "true"
          "kubernetes.io/arch" = "{{ .Values.runner_architecture | default "amd64" }}"
          "kubernetes.io/os" = "{{ .Values.runner_os | default "linux" }}"

Used GitLab Runner version

15.1

Edited by Darren Eastman