Increasing timeout for the Kubernetes API calls

Overview

Hello dear GitLab community!

My company use managed GitLab runners, Helm chart v0.49.2 installation with almost default configs.

Last time the number of the failed jobs significantly increased, probably now we have 20% of the system fails for ALL runs. Mostly it's such errors at the runner (jobs) start or end execution, like:

1. unexpected EOF errors

Examples:

Preparing environment
Waiting for pod gitlab/runner-msxcmh-b-project-43818217-concurrent-12jhm6j to be running, status is Pending
Waiting for pod gitlab/runner-msxcmh-b-project-43818217-concurrent-12jhm6j to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/gitlab/pods/runner-msxcmh-b-project-43818217-concurrent-12jhm6j/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=build&container=build&stderr=true&stdin=true&stdout=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Running on runner-msxcmh-b-project-43818217-concurrent-8887zv via battle-gitlab-runner-7ff459d977-7vcqd...
Getting source from Git repository
Fetching changes with git depth set to 20...
Initialized empty Git repository in /builds/emergetech/devops/runner-test/.git/
Created fresh repository.
Checking out 0a0010aa as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
01:46
Cleaning up project directory and file based variables
00:01
ERROR: Job failed (system failure): error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/gitlab/pods/runner-msxcmh-b-project-43818217-concurrent-8887zv/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=build&container=build&stderr=true&stdin=true&stdout=true": unexpected EOF

and

Preparing environment
Waiting for pod gitlab/runner-m4caurva-project-43818217-concurrent-265662k to be running, status is Pending
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/gitlab/pods/runner-m4caurva-project-43818217-concurrent-265662k/attach?container=helper&stdin=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

2. stuck with cleanup environment variables or direcotires

Examples:

Running on runner-m4caurva-project-43818217-concurrent-21skfzz via battle-gitlab-runner-5bb6cc4779-jdnxp...
Getting source from Git repository
00:02
Fetching changes with git depth set to 20...
Initialized empty Git repository in /builds/emergetech/devops/runner-test/.git/
Created fresh repository.
Checking out 0a0010aa as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
01:46
Cleaning up project directory and file based variables
58:06
ERROR: Job failed: execution took longer than 1h0m0s seconds

Running on runner-2gzajnre-project-43818217-concurrent-98s7nl6 via battle-gitlab-runner-6895fb784c-dzwnr...
Getting source from Git repository
Fetching changes with git depth set to 20...
Initialized empty Git repository in /builds/emergetech/devops/runner-test/.git/
Created fresh repository.
Checking out 0a0010aa as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
Cleaning up project directory and file based variables

3. Stuck at uploading artifacts

Example:

Uploading artifacts for failed job
/scripts-33584057-3846133109/upload_artifacts_on_failure: line 430: cd: /builds/emergetech/core/core-bookings: No such file or directory

It's also very common issue that bring us a pain in the neck. Be sure that this path is present if we just re-run pipeline, so it's one more random internal issue.

Our Helm chart values:

        gitlabUrl: "https://gitlab.com/"
        concurrent: 100
        runnerRegistrationToken: "xxx"
        runners:
          tags: "dev,back,kuber,linux"
          config: |
            [[runners]]
              name = "battle"
              limit = 200
              request_concurrency = 100
              executor = "kubernetes"
              [runners.kubernetes.pod_annotations]
                "pod-cleanup.gitlab.com/ttl" = "5h"
                "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
              [runners.kubernetes]
                [runners.kubernetes.pod_labels]
                  app = "gitlab-battle"
                [runners.kubernetes.node_selector]
                  agentpool = "gitlab"
                [runners.kubernetes.node_tolerations]
                  "dedicated" = "NoSchedule"
                  "kubernetes.azure.com/scalesetpriority" = "NoSchedule"
                [[runners.kubernetes.volumes.secret]]
                  name = "gitlab-docker"
                  mount_path = "/etc/.docker"
                  read_only = true
        rbac:
          create: true
          clusterWideAccess: true
        envVars:
          - name: KUBERNETES_NODE_TOLERATIONS
            value: dedicated=gitlab-heavy:NoSchedule
        podLabels:
          env: gitlab
          service: gitlab-battle
        secrets:
          - name: runner-battle
        tolerations:
          - key: "node.kubernetes.io/unreachable"
            operator: "Exists"
            effect: "NoExecute"
            tolerationSeconds: 30
          - key: "node.kubernetes.io/not-ready"
            operator: "Exists"
            effect: "NoExecute"
            tolerationSeconds: 30

As you can see - nothing special, almost default.

After some troubleshooting we figured out that we receive a lot of fails due to requests to the Kubernetes API internal calls, can I ask you - does exist some parameter to increase timeout for API call or add retries for it from the runner's side?

I will be very appreciated if you give us any advices how to improve configuration, since we're currently working on proof of concept of using GitHub Action (self-hosted) instead of GitLab CI/CD due to very unstable behavior of the last one.

Edited Apr 03, 2023 by Darren Eastman