Jobs with kubernetes executor fails with `Job failed (system failure): prepare environment: error dialing backend: dial timeout`

Status update (2022-11-23)

A fix for this bug was merged in Runner 15.6.
The MR expands the retry logic to include any errors that begin with error dialing backend that are of the "internal server error" type.

Summary

Gitlab Runner Kubernetes Executor fails with Job failed (system failure): prepare environment: error dialing backend: dial timeout.

Sometimes it works, but it is rare when it does.

Steps to reproduce

I am running jobs on GKE version 1.19.9-gke.1900, no autoscaling and no preemptible nodes, with runner version: v13.12.0.

I am using the helm chart with version 0.29.0.

.gitlab-ci.yml

stages:
  - validate
  - plan
  - apply
  - deploy

variables:
  GOOGLE_APPLICATION_CREDENTIALS: "/sa.json"

before_script:
  - apk update
  - 'which ssh-agent || ( apk add openssh-client )'
  - eval "$(ssh-agent -s)"
  - echo "$SSH_PRIVATE_KEY" | ssh-add -
  - mkdir -p ~/.ssh
  - echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config

  - echo "$GCP_SA" > /sa.json

image: alpine/terragrunt:0.14.4

include:
  - local: ml/job.yml

ml/job.yml

ml_validate:
  stage: validate
  only:
    changes:
      - ml/**/*
  script:
    - cd ml
    - terragrunt validate-all --terragrunt-parallelism 1

ml_plan:
  stage: plan
  only:
    changes:
      - ml/**/*
  script:
    - cd ml
    - terragrunt plan-all --terragrunt-parallelism 1
  dependencies:
    - ml_validate

ml_apply:
  stage: apply
  only:
    changes:
      - ml/**/*
  script:
    - cd ml
    - terragrunt apply-all --terragrunt-parallelism 1 --terragrunt-non-interactive
  dependencies:
    - ml_plan
  when: manual

Actual behavior

The job fails with one of two errors:

ERROR: Job failed (system failure): error dialing backend: dial timeout

ERROR: Job failed (system failure): prepare environment: error dialing backend: dial timeout. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Expected behavior

Expected behavior is the job runs correctly.

Relevant logs and/or screenshots

job log

Running with gitlab-runner 13.12.0 (7a6612da)
  on gitlab-runner-gitlab-runner-767d878495-p6wcw zmEZGGa6
  feature flags: FF_GITLAB_REGISTRY_HELPER_IMAGE:true
Resolving secrets 00:00
Preparing the "kubernetes" executor 00:00
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image alpine/terragrunt:0.14.4 ...
Preparing environment
Waiting for pod gitlab/runner-zmezgga6-project-1238-concurrent-0sh8wm to be running, status is Pending
Waiting for pod gitlab/runner-zmezgga6-project-1238-concurrent-0sh8wm to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab/runner-zmezgga6-project-1238-concurrent-0sh8wm to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-zmezgga6-project-1238-concurrent-0sh8wm via gitlab-runner-gitlab-runner-767d878495-p6wcw...
Getting source from Git repository 00:00
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/zmEZGGa6/0/devops/infra/.git/
Created fresh repository.
Checking out 205f73af as master...
Skipping Git submodules setup
Executing "step_script" stage of the job script
Cleaning up file based variables 00:00
ERROR: Job failed (system failure): error dialing backend: dial timeout

Environment description

I am running a self hosted Gitlab, version 13.10.3-ee (db2e358dba4)

values.yaml contents

gitlab-runner:
  checkInterval: 2
  concurrent: 30
  gitlabUrl: ... # redacted
  logFormat: json
  rbac:
    create: true
  nodeSelector:
    gitlab-runner: "true"
  podAnnotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  tolerations:
    - key: "gitlab-runner"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 100m
      memory: 128Mi
  runners:
    secret: gitlab-gitlab-runner-secret
    config: |
      [[runners]]
        url = "..." # redacted
        executor = "kubernetes"
        environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1", "FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=1", "DOCKER_DRIVER=overlay2", "DOCKER_HOST=tcp://localhost:2375","DOCKER_TLS_CERTDIR=/certs", "DOCKER_TLS_VERIFY=1","DOCKER_CERT_PATH=/certs/client", "GOPROXY=http://athens-athens-proxy.athens,direct"]
        [runners.kubernetes]
          privileged = true
          namespace = "gitlab"
          poll_interval = 5
          pod_annotations = ["cluster-autoscaler.kubernetes.io/safe-to-evict=false"]
          [runners.kubernetes.node_selector]
            "gitlab-runner" = "true"
          [runners.kubernetes.node_tolerations]
            "gitlab-runner=true" = "NoSchedule"
          [runners.cache]
            Path = "runner"
            Shared = true
            Type = "gcs"
            [runners.cache.gcs]
              BucketName = "..." # redacted
        [[runners.kubernetes.volumes.empty_dir]]
          name = "docker-certs"
          mount_path = "/certs/client"
          medium = "Memory"
    cache:
      secretName: google-application-credentials

Used GitLab Runner version

Running with gitlab-runner 13.12.0 (7a6612da)
  on gitlab-runner-gitlab-runner-767d878495-p6wcw zmEZGGa6
  feature flags: FF_GITLAB_REGISTRY_HELPER_IMAGE:true

Possible fixes

Edited Nov 23, 2022 by Darren Eastman