Preparing environment hungs - double time slowness in jobs

UPDATE! This seems to be runner isue as the timing problem doeasn't exist when reverted back to runner gitlab-runner:alpine-v14.1.0??

Summary

After gitlab upgrade helm chart to gitlab-5.4.5 (14.4.5), we observe double timing in job runs and especially in Preparing environment stage, in our kubernetes cluster Preparing environment stage sometimes takes as long as 4min and not always so hard to trace what is wrong.

Steps to reproduce

Upgrade to 5.4.5 with runner gitlab-runner:alpine-v14.4.2

Configuration used

image: docker-proxy.ourdomain.io/gitlab/gitlab-runner:alpine-v14.4.2
#image: gitlab/gitlab-runner:alpine-v13.4.1
gitlabUrl: https://gitlab.ourdomain.io/
runnerRegistrationToken: "redacted"
unregisterRunners: true
concurrent: 50

logLevel: debug

envVars:
  - name: REGISTER_RUN_UNTAGGED
    value: "true"

rbac:
  create: true
  imagePullSecrets: ["regsecret"]

## Configuration for the Pods that that the runner launches for each new job
##
runners:
  config: |
    [[runners]]
      environment = ["REGISTER_RUN_UNTAGGED=true", "RUNNER_OUTPUT_LIMIT=40960", "KUBERNETES_POLL_TIMEOUT=480"]
      [runners.kubernetes]
        image = "docker-proxy.ourdomain.io/ubuntu:18.04"
        privileged = true
        image_pull_secrets = ["regsecret"]
        poll_timeout = 600
        cpu_limit = "2"
        memory_limit = "7.5Gi"
        helper_image = "docker-proxy.ourdomain.io/gitlab/gitlab-runner-helper:x86_64-v${CI_RUNNER_VERSION}"
        [runners.kubernetes.node_selector]
          "kops.k8s.io/instancegroup" = "spot"
        [runners.kubernetes.pod_labels]
          "podLabels" = "spot"
          "team" = "shared"
          "environment" = "devops"
          "environment2" = "internal-tools"
          "component_type" = "runner"
          "service" = "gitlab-ci"
        [runners.cache]
          Type = "s3"
          Path = "gitlab_runner"
          Shared = true
          [runners.cache.s3]
            ServerAddress = "s3.amazonaws.com"
            BucketName = "ourdomain-gitlab-cache"
            BucketLocation = "eu-west-1"
            Insecure = false

  locked: false
  tags: "k8s"
#
nodeSelector: {
  kops.k8s.io/instancegroup: ondemand
}

Current behavior

Preparing environment hungs with error in runner log, jobs takes twice as long as before

gitlab-runner-test-gitlab-runner WARNING: Submitting job to coordinator... job failed  code=403 job=1759493 job-status=canceled runner=xHvg9A1q update-interval=0s 

Expected behavior

Preparing env should take much quicker

Versions

  • Chart: 5.4.5 (14.4.5)
  • Platform:
    • Cloud: (AWS)
    • Self-hosted: (Kops)
  • Kubernetes: (kubectl version)
    • Client: 1.20.12
    • Server: 1.20.12
  • Helm: (helm version)
    • Client: 3.6.3
    • Server: n/a

Relevant logs

runner debug log when hungs

Checking for jobs... received                       job=1759955 repo_url=https://gitlab.ourdomain.io/ahoy/zoovu-platform-backend.git runner=BdTZKDHs
Running with gitlab-runner 14.4.2 (50fc80a6)        job=1759955 project=12 runner=BdTZKDHs
  on gitlab-runner-test-gitlab-runner-57d464cb4b-ppdjm BdTZKDHs  job=1759955 project=12 runner=BdTZKDHs
Preparing the "kubernetes" executor     job=1759955 project=12 runner=BdTZKDHs
Regex allowing overrides for Namespace is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
Regex allowing overrides for ServiceAccount is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
Regex allowing overrides for BearerToken is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
Regex allowing overrides for PodAnnotations is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for CPURequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for MemoryRequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for EphemeralStorageRequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for CPULimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for MemoryLimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for EphemeralStorageLimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for ServiceCPURequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for ServiceMemoryRequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for ServiceEphemeralStorageRequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for ServiceCPULimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for ServiceMemoryLimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for ServiceEphemeralStorageLimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for HelperCPURequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for HelperMemoryRequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for HelperEphemeralStorageRequest is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for HelperCPULimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for HelperMemoryLimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
setting allowing overrides for HelperEphemeralStorageLimit is empty, disabling override.  job=1759955 project=12 runner=BdTZKDHs
Using Kubernetes namespace: gitlab-ci               job=1759955 project=12 runner=BdTZKDHs
Using Kubernetes executor with image docker.zoovu.io/st3-runner-image ...  job=1759955 project=12 runner=BdTZKDHs
Using attach strategy to execute scripts...         job=1759955 project=12 runner=BdTZKDHs
Using helper image: registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-50fc80a6  job=1759955 project=12 runner=BdTZKDHs
  job=1759955 project=12 runner=BdTZKDHs
Waiting for signals...                              job=1759955 project=12 runner=BdTZKDHs
No referees configured                              job=1759955 project=12 runner=BdTZKDHs
Executing build stage                               build_stage=prepare_script job=1759955 project=12 runner=BdTZKDHs
Preparing environment                   job=1759955 project=12 runner=BdTZKDHs
Starting Kubernetes command with attach...          job=1759955 project=12 runner=BdTZKDHs
Setting up secrets                                  job=1759955 project=12 runner=BdTZKDHs
Setting up scripts config map                       job=1759955 project=12 runner=BdTZKDHs
Appending trace to coordinator... ok                code=202 job=1759955 job-log=0-544 job-status=running runner=BdTZKDHs sent-log=0-543 status=202 Accepted update-interval=1m0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Submitting job to coordinator... ok                 code=200 job=1759955 job-status= runner=BdTZKDHs update-interval=0s
Setting up build pod                                job=1759955 project=12 runner=BdTZKDHs
Creating build pod                                  job=1759955 project=12 runner=BdTZKDHs
Creating pod proxy services                         job=1759955 project=12 runner=BdTZKDHs
Starting in container "helper" the command ["gitlab-runner-build" "<<<" "/scripts-12-1759955/prepare_script" "2>&1 | tee -a /logs-12-1759955/output.log"] with script: #!/usr/bin/env bash
  job=1759955 project=12 runner=BdTZKDHs
Remote process exited with the status: CommandExitCode: 0, Script: /scripts-12-1759955/prepare_script  job=1759955 project=12 runner=BdTZKDHs
Container "helper" exited with error: <nil>         job=1759955 project=12 runner=BdTZKDHs
Executing build stage                               build_stage=get_sources job=1759955 project=12 runner=BdTZKDHs
Getting source from Git repository      job=1759955 project=12 runner=BdTZKDHs
Starting Kubernetes command with attach...          job=1759955 project=12 runner=BdTZKDHs

Not sure if related to the issue:

occasional (twice per minuet or so )

 gitaly *** /var/log/gitaly/gitaly.log ***                                                                                                                                             │
│ gitaly {"level":"warning","msg":" transport: http2Server.HandleStreams failed to read frame: read unix /home/git/internal_11.sock-\u003e@: read: connection reset by peer","pid":11," │
│ gitaly {"level":"warning","msg":" transport: http2Server.HandleStreams failed to read frame: read unix /home/git/internal_11.sock-\u003e@: read: connection reset by peer","pid":11," │
│ gitaly {"level":"warning","msg":" transport: http2Server.HandleStreams failed to read frame: read unix /home/git/internal_11.sock-\u003e@: read: connection reset by peer","pid":11," │
│ gitaly {"level":"warning","msg":" transport: http2Server.HandleStreams failed to read frame: read tcp 100.102.113.106:8075-\u003e100.118.174.28:37760: read: connection reset by peer │
│ gitaly                        

Also observed intermittent secret and configmaps difficulties for runner with build to mount during startup

from runner events

4m32s       Warning   FailedMount              pod/runner-u6p2qqau-project-97-concurrent-10jlbnz           MountVolume.SetUp failed for volume "scripts" : failed to sync configmap cache: timed out waiting for the condition


2m29s       Warning   FailedMount              pod/runner-u6p2qqau-project-97-concurrent-1xjktd            MountVolume.SetUp failed for volume "default-token-2nw7s" : failed to sync secret cache: timed out waiting for the condition

interesting kube-api logs also intermittent

I0128 08:57:43.260860       1 node_authorizer.go:206] "NODE DENY" err="node 'ip-10-0-1-109.eu-west-1.compute.internal' cannot get unknown configmap gitlab-ci/runner-bt4oge9--project-12-concurrent-19-scripts4mtr8"
I0128 09:06:56.270020       1 node_authorizer.go:206] "NODE DENY" err="node 'ip-10-0-1-247.eu-west-1.compute.internal' cannot get unknown configmap gitlab-ci/runner-bt4oge9--project-74-concurrent-3-scripts6jbjn"
I0128 09:06:56.271295       1 node_authorizer.go:206] "NODE DENY" err="node 'ip-10-0-1-247.eu-west-1.compute.internal' cannot get secret gitlab-ci/regsecret, no relationship to this object was found in the node authorizer graph"
I0128 09:06:56.272043       1 node_authorizer.go:206] "NODE DENY" err="node 'ip-10-0-1-247.eu-west-1.compute.internal' cannot get secret gitlab-ci/default-token-2nw7s, no relationship to this object was found in the node authorizer graph"

More to follow if we find more info.This makes our life really hard Please help or advice.

Edited by chris dz