Jobs with kubernetes executor fails with `Job failed (system failure): prepare environment: error dialing backend: dial timeout`
Status update (2022-11-23)
-
A fix for this bug was merged in Runner 15.6.
-
The MR expands the retry logic to include any errors that begin with
error dialing backend
that are of the "internal server error" type.
Summary
Gitlab Runner Kubernetes Executor fails with Job failed (system failure): prepare environment: error dialing backend: dial timeout
.
Sometimes it works, but it is rare when it does.
Steps to reproduce
I am running jobs on GKE version 1.19.9-gke.1900
, no autoscaling and no preemptible nodes, with runner version: v13.12.0
.
I am using the helm chart with version 0.29.0
.
.gitlab-ci.yml
stages:
- validate
- plan
- apply
- deploy
variables:
GOOGLE_APPLICATION_CREDENTIALS: "/sa.json"
before_script:
- apk update
- 'which ssh-agent || ( apk add openssh-client )'
- eval "$(ssh-agent -s)"
- echo "$SSH_PRIVATE_KEY" | ssh-add -
- mkdir -p ~/.ssh
- echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
- echo "$GCP_SA" > /sa.json
image: alpine/terragrunt:0.14.4
include:
- local: ml/job.yml
ml/job.yml
ml_validate:
stage: validate
only:
changes:
- ml/**/*
script:
- cd ml
- terragrunt validate-all --terragrunt-parallelism 1
ml_plan:
stage: plan
only:
changes:
- ml/**/*
script:
- cd ml
- terragrunt plan-all --terragrunt-parallelism 1
dependencies:
- ml_validate
ml_apply:
stage: apply
only:
changes:
- ml/**/*
script:
- cd ml
- terragrunt apply-all --terragrunt-parallelism 1 --terragrunt-non-interactive
dependencies:
- ml_plan
when: manual
Actual behavior
The job fails with one of two errors:
ERROR: Job failed (system failure): error dialing backend: dial timeout
or
ERROR: Job failed (system failure): prepare environment: error dialing backend: dial timeout. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Expected behavior
Expected behavior is the job runs correctly.
Relevant logs and/or screenshots
job log
Running with gitlab-runner 13.12.0 (7a6612da)
on gitlab-runner-gitlab-runner-767d878495-p6wcw zmEZGGa6
feature flags: FF_GITLAB_REGISTRY_HELPER_IMAGE:true
Resolving secrets 00:00
Preparing the "kubernetes" executor 00:00
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image alpine/terragrunt:0.14.4 ...
Preparing environment
Waiting for pod gitlab/runner-zmezgga6-project-1238-concurrent-0sh8wm to be running, status is Pending
Waiting for pod gitlab/runner-zmezgga6-project-1238-concurrent-0sh8wm to be running, status is Pending
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab/runner-zmezgga6-project-1238-concurrent-0sh8wm to be running, status is Pending
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-zmezgga6-project-1238-concurrent-0sh8wm via gitlab-runner-gitlab-runner-767d878495-p6wcw...
Getting source from Git repository 00:00
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/zmEZGGa6/0/devops/infra/.git/
Created fresh repository.
Checking out 205f73af as master...
Skipping Git submodules setup
Executing "step_script" stage of the job script
Cleaning up file based variables 00:00
ERROR: Job failed (system failure): error dialing backend: dial timeout
Environment description
I am running a self hosted Gitlab, version 13.10.3-ee (db2e358dba4)
values.yaml contents
gitlab-runner:
checkInterval: 2
concurrent: 30
gitlabUrl: ... # redacted
logFormat: json
rbac:
create: true
nodeSelector:
gitlab-runner: "true"
podAnnotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
tolerations:
- key: "gitlab-runner"
operator: "Equal"
value: "true"
effect: "NoSchedule"
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
runners:
secret: gitlab-gitlab-runner-secret
config: |
[[runners]]
url = "..." # redacted
executor = "kubernetes"
environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1", "FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=1", "DOCKER_DRIVER=overlay2", "DOCKER_HOST=tcp://localhost:2375","DOCKER_TLS_CERTDIR=/certs", "DOCKER_TLS_VERIFY=1","DOCKER_CERT_PATH=/certs/client", "GOPROXY=http://athens-athens-proxy.athens,direct"]
[runners.kubernetes]
privileged = true
namespace = "gitlab"
poll_interval = 5
pod_annotations = ["cluster-autoscaler.kubernetes.io/safe-to-evict=false"]
[runners.kubernetes.node_selector]
"gitlab-runner" = "true"
[runners.kubernetes.node_tolerations]
"gitlab-runner=true" = "NoSchedule"
[runners.cache]
Path = "runner"
Shared = true
Type = "gcs"
[runners.cache.gcs]
BucketName = "..." # redacted
[[runners.kubernetes.volumes.empty_dir]]
name = "docker-certs"
mount_path = "/certs/client"
medium = "Memory"
cache:
secretName: google-application-credentials
Used GitLab Runner version
Running with gitlab-runner 13.12.0 (7a6612da)
on gitlab-runner-gitlab-runner-767d878495-p6wcw zmEZGGa6
feature flags: FF_GITLAB_REGISTRY_HELPER_IMAGE:true
Possible fixes
Edited by Darren Eastman