Increasing timeout for the Kubernetes API calls
Overview
Hello dear GitLab community!
My company use managed GitLab runners, Helm chart v0.49.2 installation with almost default configs.
Last time the number of the failed jobs significantly increased, probably now we have 20% of the system fails for ALL runs. Mostly it's such errors at the runner (jobs) start or end execution, like:
1. unexpected EOF errors
Examples:
Preparing environment
Waiting for pod gitlab/runner-msxcmh-b-project-43818217-concurrent-12jhm6j to be running, status is Pending
Waiting for pod gitlab/runner-msxcmh-b-project-43818217-concurrent-12jhm6j to be running, status is Pending
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/gitlab/pods/runner-msxcmh-b-project-43818217-concurrent-12jhm6j/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=build&container=build&stderr=true&stdin=true&stdout=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
or
Running on runner-msxcmh-b-project-43818217-concurrent-8887zv via battle-gitlab-runner-7ff459d977-7vcqd...
Getting source from Git repository
Fetching changes with git depth set to 20...
Initialized empty Git repository in /builds/emergetech/devops/runner-test/.git/
Created fresh repository.
Checking out 0a0010aa as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
01:46
Cleaning up project directory and file based variables
00:01
ERROR: Job failed (system failure): error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/gitlab/pods/runner-msxcmh-b-project-43818217-concurrent-8887zv/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=build&container=build&stderr=true&stdin=true&stdout=true": unexpected EOF
and
Preparing environment
Waiting for pod gitlab/runner-m4caurva-project-43818217-concurrent-265662k to be running, status is Pending
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/gitlab/pods/runner-m4caurva-project-43818217-concurrent-265662k/attach?container=helper&stdin=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
2. stuck with cleanup environment variables or direcotires
Examples:
Running on runner-m4caurva-project-43818217-concurrent-21skfzz via battle-gitlab-runner-5bb6cc4779-jdnxp...
Getting source from Git repository
00:02
Fetching changes with git depth set to 20...
Initialized empty Git repository in /builds/emergetech/devops/runner-test/.git/
Created fresh repository.
Checking out 0a0010aa as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
01:46
Cleaning up project directory and file based variables
58:06
ERROR: Job failed: execution took longer than 1h0m0s seconds
or
Running on runner-2gzajnre-project-43818217-concurrent-98s7nl6 via battle-gitlab-runner-6895fb784c-dzwnr...
Getting source from Git repository
Fetching changes with git depth set to 20...
Initialized empty Git repository in /builds/emergetech/devops/runner-test/.git/
Created fresh repository.
Checking out 0a0010aa as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
Cleaning up project directory and file based variables
3. Stuck at uploading artifacts
Example:
Uploading artifacts for failed job
/scripts-33584057-3846133109/upload_artifacts_on_failure: line 430: cd: /builds/emergetech/core/core-bookings: No such file or directory
It's also very common issue that bring us a pain in the neck. Be sure that this path is present if we just re-run pipeline, so it's one more random internal issue.
Our Helm chart values:
gitlabUrl: "https://gitlab.com/"
concurrent: 100
runnerRegistrationToken: "xxx"
runners:
tags: "dev,back,kuber,linux"
config: |
[[runners]]
name = "battle"
limit = 200
request_concurrency = 100
executor = "kubernetes"
[runners.kubernetes.pod_annotations]
"pod-cleanup.gitlab.com/ttl" = "5h"
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
[runners.kubernetes]
[runners.kubernetes.pod_labels]
app = "gitlab-battle"
[runners.kubernetes.node_selector]
agentpool = "gitlab"
[runners.kubernetes.node_tolerations]
"dedicated" = "NoSchedule"
"kubernetes.azure.com/scalesetpriority" = "NoSchedule"
[[runners.kubernetes.volumes.secret]]
name = "gitlab-docker"
mount_path = "/etc/.docker"
read_only = true
rbac:
create: true
clusterWideAccess: true
envVars:
- name: KUBERNETES_NODE_TOLERATIONS
value: dedicated=gitlab-heavy:NoSchedule
podLabels:
env: gitlab
service: gitlab-battle
secrets:
- name: runner-battle
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30
As you can see - nothing special, almost default.
After some troubleshooting we figured out that we receive a lot of fails due to requests to the Kubernetes API internal calls, can I ask you - does exist some parameter to increase timeout for API call or add retries for it from the runner's side?
I will be very appreciated if you give us any advices how to improve configuration, since we're currently working on proof of concept of using GitHub Action (self-hosted) instead of GitLab CI/CD due to very unstable behavior of the last one.