Kubernetes runner: ERROR: Job failed (system failure): prepare environment: waiting for pod running
I open this issue to report an inconsistent behavior with the kubernetes executor. We have a Kubernetes (`v1.31.1`) node pool dedicated to our GitLab's runners, configured as follow: - autoscaling: true - desired nodes: 0 - min nodes: 0 - max nodes: 3 We use tolerations to dispatch runner's pods on this pool, the problem is that the node scale up take some time and we have the following error: ```console Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling." Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling." Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling." WARNING: Event retrieved from the cluster: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. ERROR: Job failed (system failure): prepare environment: waiting for pod running: Get "https://10.3.0.1:443/api/v1/namespaces/gitlab/pods/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd": dial tcp 10.3.0.1:443: connect: connection refused. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information ``` On a **same pipeline** with the **same runner** some jobs are waiting correctly for minutes and others failed after ~60 secondes only. Our runner configuration (`gitlab/gitlab-runner-helper:alpine3.19-x86_64-v17.6.0`) look like this : ```toml concurrent = 24 check_interval = 3 log_level = "info" connection_max_age = "15m0s" shutdown_timeout = 0 [session_server] session_timeout = 1800 [[runners]] name = "gitlab-runner-76ddf4c6df-bnnl9" url = "[redacted]" id = 9 token = "[redacted]" token_obtained_at = 2024-12-17T13:44:33Z token_expires_at = 0001-01-01T00:00:00Z executor = "kubernetes" [runners.custom_build_dir] [runners.cache] Type = "s3" Shared = true MaxUploadedArchiveSize = 0 [runners.cache.s3] ServerAddress = "[redacted]" AccessKey = "[redacted]" SecretKey = "[redacted]" BucketName = "[redacted]" BucketLocation = "gra" [runners.cache.gcs] [runners.cache.azure] [runners.feature_flags] FF_RETRIEVE_POD_WARNING_EVENTS = true FF_USE_FASTZIP = true FF_WAIT_FOR_POD_TO_BE_REACHABLE = true [runners.kubernetes] host = "" bearer_token_overwrite_allowed = false image = "" namespace = "gitlab" namespace_overwrite_allowed = "" namespace_per_job = false allow_privilege_escalation = false memory_limit = "1G" service_memory_limit = "1G" allowed_images = ["[redacted]/*:*", "[redacted]/*/*:*"] allowed_pull_policies = ["always", "if-not-present"] allowed_services = ["[redacted]/*:*", "[redacted]/*/*:*"] pull_policy = ["always", "if-not-present"] node_selector_overwrite_allowed = "" node_tolerations_overwrite_allowed = "" image_pull_secrets = ["registry-creds"] helper_image = "gitlab/gitlab-runner-helper:alpine3.19-x86_64-v17.6.0" poll_interval = 5 poll_timeout = 1000 retry_limit = 90 pod_labels_overwrite_allowed = "" service_account_overwrite_allowed = "" pod_annotations_overwrite_allowed = "" [runners.kubernetes.node_selector] "[redacted]/role" = "ci" [runners.kubernetes.node_tolerations] "[redacted]/role=ci" = "NoSchedule" [runners.kubernetes.init_permissions_container_security_context] [runners.kubernetes.init_permissions_container_security_context.capabilities] [runners.kubernetes.build_container_security_context] [runners.kubernetes.build_container_security_context.capabilities] [runners.kubernetes.helper_container_security_context] [runners.kubernetes.helper_container_security_context.capabilities] [runners.kubernetes.service_container_security_context] [runners.kubernetes.service_container_security_context.capabilities] [runners.kubernetes.volumes] [runners.kubernetes.dns_config] ```
issue