Resilient Jobs on GKE autoscaling/preemptible node pools with GitLab Runner Kubernetes executor

Summary

We are still seeing this issue #3247 (closed) on runner 12.10.1.

Using FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=0, instead of the old error dialing backend: EOF error, we see some retries but still eventual failure, causing a significant percentage of our jobs to fail.

We have been using this MR !1664 (closed) as our gitlab-runner image for 3 months now and it seems to solve this issue.

Steps to reproduce

Run jobs on a autoscaling and preemptible node pool. Some jobs will eventually fail due to connection losses with the job pods, even after some retries introduced in 12.9 and 12.10.

Relevant logs and/or screenshots

job log

For example:

WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-33pksct/helper: Get https://10.132.0.127:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-33pksct/helper?follow=true&sinceTime=2020-04-28T07%3A15%3A01Z&timestamps=true: ssh: rejected: connect failed (Connection refused). Retrying...
WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-33pksct/build: Get https://10.132.0.127:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-33pksct/build?follow=true&sinceTime=2020-04-28T07%3A16%3A07Z&timestamps=true: EOF. Retrying...
[...more retries redacted...]
Running after_script
Uploading artifacts for failed job
ERROR: Job failed: pod "runner-5sz1vu3k-project-3-concurrent-15g9bf6" status is "Pending"

and

 WARNING: Error reading log for dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build: bufio.Scanner: token too long. Retrying...
[...]
 WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build: Get https://10.132.0.46:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build?follow=true&sinceTime=2020-04-28T05%3A36%3A33Z&timestamps=true: ssh: rejected: connect failed (Connection timed out). Retrying...
[...]
 WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/helper: pods "gke-uc-cluster-gitlab-builds-pool-636afef3-gvwq" not found. Retrying...
Running after_script
Uploading artifacts for failed job
ERROR: Job failed: pods "runner-5sz1vu3k-project-3-concurrent-138lgt7" not found

Environment description

Gitlab and gitlab-runner installed by helm charts on non-preemptible GCP nodes. Runner jobs themselves run on a pool of autoscaling and preemptible nodes.

config.toml contents
listen_address = ":9252"
concurrent = 999
check_interval = 1
log_level = "info"

[session_server]
  session_timeout = 1800

[[runners]]
  name = "runner-gitlab-runner-5cd8859c8d-9p69d"
  output_limit = 4096
  request_concurrency = 1
  url = "[redacted]"
  token = "[redacted]"
  executor = "kubernetes"
  [runners.custom_build_dir]
  [runners.cache]
    Type = "gcs"
    Shared = true
    [runners.cache.s3]
    [runners.cache.gcs]
      AccessID = "[redacted]"
      PrivateKey = "[redacted]"
      BucketName = "...-runner-cache"
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "ubuntu:16.04"
    namespace = "dev"
    namespace_overwrite_allowed = ""
    privileged = true
    cpu_request = "800m"
    service_cpu_request = "400m"
    poll_timeout = 3600
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    [runners.kubernetes.node_selector]
      "cloud.google.com/gke-nodepool" = "gitlab-builds-pool"
    [runners.kubernetes.pod_security_context]
    [runners.kubernetes.volumes]
Helm chart values

gitlabUrl: [redacted]
runnerRegistrationToken: [redacted]

concurrent: 999
# image: .../gitlab-runner-1664-nodockerfile:1.0.0
checkInterval: 1

runners:
  privileged: true
  namespace: dev
  pollTimeout: 3600
  nodeSelector:
    cloud.google.com/gke-nodepool: gitlab-builds-pool
  cache:
    cacheShared: true
    cacheType: gcs
    gcsBucketName: [redacted]
    secretName: [redacted]
  builds:
    cpuRequests: 800m
  services:
    cpuRequests: 400m

nodeSelector:
  cloud.google.com/gke-nodepool: default-pool

Used GitLab Runner version

Version:      12.10.1
Git revision: ce065b93
Git branch:   12-10-stable
GO version:   go1.13.8
Built:        2020-04-22T21:29:52+0000
OS/Arch:      linux/amd64

Helm chart 0.16.0.

Related issues

#3247 (closed)

#4119 (closed)

Possible fixes

We have been using this patch by Chet Lemon !1664 (closed) on top of 11.10 for some months now with success (as seen on #4119 (comment 243809158)) - that is, at least from our POV, the use of k8s jobs seems to solve this issue. We wonder if marking nodes that are currently running jobs as not safe for scale down - and remove this mark when the build completes - is a viable solution.

Thanks in advance.

Edited by Darren Eastman