Kubernetes runner: ERROR: Job failed (system failure): prepare environment: waiting for pod running
I open this issue to report an inconsistent behavior with the kubernetes executor.
We have a Kubernetes (`v1.31.1`) node pool dedicated to our GitLab's runners, configured as follow:
- autoscaling: true
- desired nodes: 0
- min nodes: 0
- max nodes: 3
We use tolerations to dispatch runner's pods on this pool, the problem is that the node scale up take some time and we have the following error:
```console
Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending
Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."
Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending
Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."
Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending
Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."
WARNING: Event retrieved from the cluster: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
ERROR: Job failed (system failure): prepare environment: waiting for pod running: Get "https://10.3.0.1:443/api/v1/namespaces/gitlab/pods/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd": dial tcp 10.3.0.1:443: connect: connection refused. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
```
On a **same pipeline** with the **same runner** some jobs are waiting correctly for minutes and others failed after ~60 secondes only.
Our runner configuration (`gitlab/gitlab-runner-helper:alpine3.19-x86_64-v17.6.0`) look like this :
```toml
concurrent = 24
check_interval = 3
log_level = "info"
connection_max_age = "15m0s"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlab-runner-76ddf4c6df-bnnl9"
url = "[redacted]"
id = 9
token = "[redacted]"
token_obtained_at = 2024-12-17T13:44:33Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "kubernetes"
[runners.custom_build_dir]
[runners.cache]
Type = "s3"
Shared = true
MaxUploadedArchiveSize = 0
[runners.cache.s3]
ServerAddress = "[redacted]"
AccessKey = "[redacted]"
SecretKey = "[redacted]"
BucketName = "[redacted]"
BucketLocation = "gra"
[runners.cache.gcs]
[runners.cache.azure]
[runners.feature_flags]
FF_RETRIEVE_POD_WARNING_EVENTS = true
FF_USE_FASTZIP = true
FF_WAIT_FOR_POD_TO_BE_REACHABLE = true
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = ""
namespace = "gitlab"
namespace_overwrite_allowed = ""
namespace_per_job = false
allow_privilege_escalation = false
memory_limit = "1G"
service_memory_limit = "1G"
allowed_images = ["[redacted]/*:*", "[redacted]/*/*:*"]
allowed_pull_policies = ["always", "if-not-present"]
allowed_services = ["[redacted]/*:*", "[redacted]/*/*:*"]
pull_policy = ["always", "if-not-present"]
node_selector_overwrite_allowed = ""
node_tolerations_overwrite_allowed = ""
image_pull_secrets = ["registry-creds"]
helper_image = "gitlab/gitlab-runner-helper:alpine3.19-x86_64-v17.6.0"
poll_interval = 5
poll_timeout = 1000
retry_limit = 90
pod_labels_overwrite_allowed = ""
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
[runners.kubernetes.node_selector]
"[redacted]/role" = "ci"
[runners.kubernetes.node_tolerations]
"[redacted]/role=ci" = "NoSchedule"
[runners.kubernetes.init_permissions_container_security_context]
[runners.kubernetes.init_permissions_container_security_context.capabilities]
[runners.kubernetes.build_container_security_context]
[runners.kubernetes.build_container_security_context.capabilities]
[runners.kubernetes.helper_container_security_context]
[runners.kubernetes.helper_container_security_context.capabilities]
[runners.kubernetes.service_container_security_context]
[runners.kubernetes.service_container_security_context.capabilities]
[runners.kubernetes.volumes]
[runners.kubernetes.dns_config]
```
issue