Resilient Jobs on GKE autoscaling/preemptible node pools with GitLab Runner Kubernetes executor
Summary
We are still seeing this issue #3247 (closed) on runner 12.10.1.
Using FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=0, instead of the old error dialing backend: EOF error, we see some retries but still eventual failure, causing a significant percentage of our jobs to fail.
We have been using this MR !1664 (closed) as our gitlab-runner image for 3 months now and it seems to solve this issue.
Steps to reproduce
Run jobs on a autoscaling and preemptible node pool. Some jobs will eventually fail due to connection losses with the job pods, even after some retries introduced in 12.9 and 12.10.
Relevant logs and/or screenshots
job log
For example:
WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-33pksct/helper: Get https://10.132.0.127:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-33pksct/helper?follow=true&sinceTime=2020-04-28T07%3A15%3A01Z×tamps=true: ssh: rejected: connect failed (Connection refused). Retrying...
WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-33pksct/build: Get https://10.132.0.127:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-33pksct/build?follow=true&sinceTime=2020-04-28T07%3A16%3A07Z×tamps=true: EOF. Retrying...
[...more retries redacted...]
Running after_script
Uploading artifacts for failed job
ERROR: Job failed: pod "runner-5sz1vu3k-project-3-concurrent-15g9bf6" status is "Pending"
and
WARNING: Error reading log for dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build: bufio.Scanner: token too long. Retrying...
[...]
WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build: Get https://10.132.0.46:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build?follow=true&sinceTime=2020-04-28T05%3A36%3A33Z×tamps=true: ssh: rejected: connect failed (Connection timed out). Retrying...
[...]
WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/helper: pods "gke-uc-cluster-gitlab-builds-pool-636afef3-gvwq" not found. Retrying...
Running after_script
Uploading artifacts for failed job
ERROR: Job failed: pods "runner-5sz1vu3k-project-3-concurrent-138lgt7" not found
Environment description
Gitlab and gitlab-runner installed by helm charts on non-preemptible GCP nodes. Runner jobs themselves run on a pool of autoscaling and preemptible nodes.
config.toml contents
listen_address = ":9252"
concurrent = 999
check_interval = 1
log_level = "info"
[session_server]
session_timeout = 1800
[[runners]]
name = "runner-gitlab-runner-5cd8859c8d-9p69d"
output_limit = 4096
request_concurrency = 1
url = "[redacted]"
token = "[redacted]"
executor = "kubernetes"
[runners.custom_build_dir]
[runners.cache]
Type = "gcs"
Shared = true
[runners.cache.s3]
[runners.cache.gcs]
AccessID = "[redacted]"
PrivateKey = "[redacted]"
BucketName = "...-runner-cache"
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = "ubuntu:16.04"
namespace = "dev"
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "800m"
service_cpu_request = "400m"
poll_timeout = 3600
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
[runners.kubernetes.node_selector]
"cloud.google.com/gke-nodepool" = "gitlab-builds-pool"
[runners.kubernetes.pod_security_context]
[runners.kubernetes.volumes]
Helm chart values
gitlabUrl: [redacted]
runnerRegistrationToken: [redacted]
concurrent: 999
# image: .../gitlab-runner-1664-nodockerfile:1.0.0
checkInterval: 1
runners:
privileged: true
namespace: dev
pollTimeout: 3600
nodeSelector:
cloud.google.com/gke-nodepool: gitlab-builds-pool
cache:
cacheShared: true
cacheType: gcs
gcsBucketName: [redacted]
secretName: [redacted]
builds:
cpuRequests: 800m
services:
cpuRequests: 400m
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
Used GitLab Runner version
Version: 12.10.1
Git revision: ce065b93
Git branch: 12-10-stable
GO version: go1.13.8
Built: 2020-04-22T21:29:52+0000
OS/Arch: linux/amd64
Helm chart 0.16.0.
Related issues
Possible fixes
We have been using this patch by Chet Lemon !1664 (closed) on top of 11.10 for some months now with success (as seen on #4119 (comment 243809158)) - that is, at least from our POV, the use of k8s jobs seems to solve this issue. We wonder if marking nodes that are currently running jobs as not safe for scale down - and remove this mark when the build completes - is a viable solution.
Thanks in advance.