Kubernetes runner: ERROR: Job failed (system failure): prepare environment: waiting for pod running

I open this issue to report an inconsistent behavior with the kubernetes executor.

We have a Kubernetes (v1.31.1) node pool dedicated to our GitLab's runners, configured as follow:

  • autoscaling: true
  • desired nodes: 0
  • min nodes: 0
  • max nodes: 3

We use tolerations to dispatch runner's pods on this pool, the problem is that the node scale up take some time and we have the following error:

Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending
	Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."
Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending
	Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."
Waiting for pod gitlab/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd to be running, status is Pending
	Unschedulable: "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."
WARNING: Event retrieved from the cluster: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
ERROR: Job failed (system failure): prepare environment: waiting for pod running: Get "https://10.3.0.1:443/api/v1/namespaces/gitlab/pods/runner-t1m4yqdx-project-5-concurrent-0-au24ilpd": dial tcp 10.3.0.1:443: connect: connection refused. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

On a same pipeline with the same runner some jobs are waiting correctly for minutes and others failed after ~60 secondes only.

Our runner configuration (gitlab/gitlab-runner-helper:alpine3.19-x86_64-v17.6.0) look like this :

concurrent = 24
check_interval = 3
log_level = "info"
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-runner-76ddf4c6df-bnnl9"
  url = "[redacted]"
  id = 9
  token = "[redacted]"
  token_obtained_at = 2024-12-17T13:44:33Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "kubernetes"
  [runners.custom_build_dir]
  [runners.cache]
    Type = "s3"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
      ServerAddress = "[redacted]"
      AccessKey = "[redacted]"
      SecretKey = "[redacted]"
      BucketName = "[redacted]"
      BucketLocation = "gra"
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.feature_flags]
    FF_RETRIEVE_POD_WARNING_EVENTS = true
    FF_USE_FASTZIP = true
    FF_WAIT_FOR_POD_TO_BE_REACHABLE = true
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = ""
    namespace = "gitlab"
    namespace_overwrite_allowed = ""
    namespace_per_job = false
    allow_privilege_escalation = false
    memory_limit = "1G"
    service_memory_limit = "1G"
    allowed_images = ["[redacted]/*:*", "[redacted]/*/*:*"]
    allowed_pull_policies = ["always", "if-not-present"]
    allowed_services = ["[redacted]/*:*", "[redacted]/*/*:*"]
    pull_policy = ["always", "if-not-present"]
    node_selector_overwrite_allowed = ""
    node_tolerations_overwrite_allowed = ""
    image_pull_secrets = ["registry-creds"]
    helper_image = "gitlab/gitlab-runner-helper:alpine3.19-x86_64-v17.6.0"
    poll_interval = 5
    poll_timeout = 1000
    retry_limit = 90
    pod_labels_overwrite_allowed = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    [runners.kubernetes.node_selector]
      "[redacted]/role" = "ci"
    [runners.kubernetes.node_tolerations]
      "[redacted]/role=ci" = "NoSchedule"
    [runners.kubernetes.init_permissions_container_security_context]
      [runners.kubernetes.init_permissions_container_security_context.capabilities]
    [runners.kubernetes.build_container_security_context]
      [runners.kubernetes.build_container_security_context.capabilities]
    [runners.kubernetes.helper_container_security_context]
      [runners.kubernetes.helper_container_security_context.capabilities]
    [runners.kubernetes.service_container_security_context]
      [runners.kubernetes.service_container_security_context.capabilities]
    [runners.kubernetes.volumes]
    [runners.kubernetes.dns_config]
Edited by Maxime Loliée