K8s runner jobs on GKE autoscaling/preemptible node pools: EOF retries
We are still seeing this issue #3247 (closed) on runner 12.10.1.
FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=0, instead of the old
error dialing backend: EOF error, we see some retries but still eventual failure, causing a significant percentage of our jobs to fail.
We have been using this MR !1664 (closed) as our gitlab-runner image for 3 months now and it seems to solve this issue.
Steps to reproduce
Run jobs on a autoscaling and preemptible node pool. Some jobs will eventually fail due to connection losses with the job pods, even after some retries introduced in 12.9 and 12.10.
Relevant logs and/or screenshots
WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-33pksct/helper: Get https://10.132.0.127:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-33pksct/helper?follow=true&sinceTime=2020-04-28T07%3A15%3A01Z×tamps=true: ssh: rejected: connect failed (Connection refused). Retrying... WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-33pksct/build: Get https://10.132.0.127:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-33pksct/build?follow=true&sinceTime=2020-04-28T07%3A16%3A07Z×tamps=true: EOF. Retrying... [...more retries redacted...] Running after_script Uploading artifacts for failed job ERROR: Job failed: pod "runner-5sz1vu3k-project-3-concurrent-15g9bf6" status is "Pending"
WARNING: Error reading log for dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build: bufio.Scanner: token too long. Retrying... [...] WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build: Get https://10.132.0.46:10250/containerLogs/dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/build?follow=true&sinceTime=2020-04-28T05%3A36%3A33Z×tamps=true: ssh: rejected: connect failed (Connection timed out). Retrying... [...] WARNING: Error attaching to log dev/runner-5sz1vu3k-project-3-concurrent-138lgt7/helper: pods "gke-uc-cluster-gitlab-builds-pool-636afef3-gvwq" not found. Retrying... Running after_script Uploading artifacts for failed job ERROR: Job failed: pods "runner-5sz1vu3k-project-3-concurrent-138lgt7" not found
Gitlab and gitlab-runner installed by helm charts on non-preemptible GCP nodes. Runner jobs themselves run on a pool of autoscaling and preemptible nodes.
listen_address = ":9252" concurrent = 999 check_interval = 1 log_level = "info" [session_server] session_timeout = 1800 [[runners]] name = "runner-gitlab-runner-5cd8859c8d-9p69d" output_limit = 4096 request_concurrency = 1 url = "[redacted]" token = "[redacted]" executor = "kubernetes" [runners.custom_build_dir] [runners.cache] Type = "gcs" Shared = true [runners.cache.s3] [runners.cache.gcs] AccessID = "[redacted]" PrivateKey = "[redacted]" BucketName = "...-runner-cache" [runners.kubernetes] host = "" bearer_token_overwrite_allowed = false image = "ubuntu:16.04" namespace = "dev" namespace_overwrite_allowed = "" privileged = true cpu_request = "800m" service_cpu_request = "400m" poll_timeout = 3600 service_account_overwrite_allowed = "" pod_annotations_overwrite_allowed = "" [runners.kubernetes.node_selector] "cloud.google.com/gke-nodepool" = "gitlab-builds-pool" [runners.kubernetes.pod_security_context] [runners.kubernetes.volumes]
Helm chart values
gitlabUrl: [redacted] runnerRegistrationToken: [redacted] concurrent: 999 # image: .../gitlab-runner-1664-nodockerfile:1.0.0 checkInterval: 1 runners: privileged: true namespace: dev pollTimeout: 3600 nodeSelector: cloud.google.com/gke-nodepool: gitlab-builds-pool cache: cacheShared: true cacheType: gcs gcsBucketName: [redacted] secretName: [redacted] builds: cpuRequests: 800m services: cpuRequests: 400m nodeSelector: cloud.google.com/gke-nodepool: default-pool
Used GitLab Runner version
Version: 12.10.1 Git revision: ce065b93 Git branch: 12-10-stable GO version: go1.13.8 Built: 2020-04-22T21:29:52+0000 OS/Arch: linux/amd64
Helm chart 0.16.0.
We have been using this patch by Chet Lemon !1664 (closed) on top of 11.10 for some months now with success (as seen on #4119 (comment 243809158)) - that is, at least from our POV, the use of k8s jobs seems to solve this issue. We wonder if marking nodes that are currently running jobs as not safe for scale down - and remove this mark when the build completes - is a viable solution.
Thanks in advance.