Kubernetes Runner Container pods taking too long to start in EKS: waiting for pod running: timed out waiting for pod to start.

Summary

The Runners are deployed in Azure AKS, AWS EKS, and GCP GKE. The runners in AKS and GKE appear to be mostly functioning as intended though we are getting some errors there. However, runners in EKS are not.

Internal Zendesk ticket for reference.

Actual behavior

Pods take too long to spawn and start causing them to get timed out.

Expected behavior

Pod runners to spawn and start faster.

Relevant logs and/or screenshots

ERROR: Job failed (system failure): pods \"\" not found

Waiting for pod sast-runner/runner- to be running, status is Pending
  ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
  ContainersNotReady: "containers with unready status: [build helper]"
  ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod sast-runner/runner- to be running, status is Pending
  ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
  ContainersNotReady: "containers with unready status: [build helper]"
  ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod sast-runner/runner- to be running, status is Pending
  ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
  ContainersNotReady: "containers with unready status: [build helper]"
  ContainersNotReady: "containers with unready status: [build helper]"
ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Environment description

  • Config.toml file
listen_address = ":9252"
concurrent = 50
check_interval = 3
log_level = "info"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = 
  output_limit = 40000
  url = 
  id = 
  token = 
  token_obtained_at = 0000-00-00T00:00:00Z
  token_expires_at = 0000-00-00T00:00:00Z
  executor = "kubernetes"
  [runners.custom_build_dir]
  [runners.cache]
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = ""
    namespace = ""
    namespace_overwrite_allowed = ""
    privileged = true
    cpu_request = "500m"
    cpu_request_overwrite_max_allowed = "7000m"
    memory_limit = "5Gi"
    memory_limit_overwrite_max_allowed = "8Gi"
    memory_request = "1Gi"
    memory_request_overwrite_max_allowed = "8Gi"
    service_cpu_request = "500m"
    service_cpu_request_overwrite_max_allowed = "2"
    service_memory_limit = "2300Mi"
    service_memory_limit_overwrite_max_allowed = "3Gi"
    service_memory_request = "500Mi"
    service_memory_request_overwrite_max_allowed = "3Gi"
    helper_cpu_request = "800m"
    helper_cpu_request_overwrite_max_allowed = "1.5"
    helper_memory_limit = "2300Mi"
    helper_memory_limit_overwrite_max_allowed = "3Gi"
    helper_memory_request = "500Mi"
    helper_memory_request_overwrite_max_allowed = "3Gi"
    pull_policy = ["if-not-present"]
    node_selector_overwrite_allowed = ""
    helper_image = ""
    pod_labels_overwrite_allowed = ""
    service_account = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    [runners.kubernetes.affinity]
    [runners.kubernetes.pod_security_context]
    [runners.kubernetes.init_permissions_container_security_context]
      [runners.kubernetes.init_permissions_container_security_context.capabilities]
    [runners.kubernetes.build_container_security_context]
      [runners.kubernetes.build_container_security_context.capabilities]
    [runners.kubernetes.helper_container_security_context]
      [runners.kubernetes.helper_container_security_context.capabilities]
    [runners.kubernetes.service_container_security_context]
      [runners.kubernetes.service_container_security_context.capabilities]
    [runners.kubernetes.volumes]
    [runners.kubernetes.dns_config]
    [runners.kubernetes.container_lifecycle]
  • Pod logs:
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

Configuration (with the authentication token) was saved in "/home/gitlab-runner/.gitlab-runner/config.toml"
Runtime platform                                    arch=amd64 os=linux pid=7 revision=3046fee8 version=16.6.0
Starting multi-runner from /home/gitlab-runner/.gitlab-runner/config.toml...  builds=0 max_builds=0
WARNING: Running in user-mode.
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...

There might be a problem with your config based on jsonschema annotations in common/config.go (experimental feature):
jsonschema: '/runners/0/kubernetes/affinity/pod_anti_affinity/required_during_scheduling_ignored_during_execution' does not validate with https://gitlab.com/gitlab-org/gitlab-runner/common/config#/$ref/properties/runners/items/$ref/properties/kubernetes/$ref/properties/affinity/$ref/properties/pod_anti_affinity/$ref/properties/required_during_scheduling_ignored_during_execution/type: expected array, but got null
  • Job logs:
Running with gitlab-runner 16.6.0 (3046fee8)
  on , system ID: r_ziC32FAbUHVg
  feature flags: FF_PRINT_POD_EVENTS:true
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: lightspeed-runner
Using Kubernetes executor with image 
Using attach strategy to execute scripts...
Preparing environment
03:04
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 4h0m0s...
Subscribing to Kubernetes Pod events...
Type     Reason      Message
Normal   Scheduled   Successfully assigned lightspeed-runner
Normal   Pulling   Pulling image "gitlab-runner/gitlab-runner-helper:x86_64-v16.5.0"
Normal   Pulled   Successfully pulled image "gitlab-runner-helper:x86_64-v16.5.0" in 36.483990843s (36.484013983s including waiting)
Normal   Created   Created container init-permissions
Normal   Started   Started container init-permissions
ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Used GitLab Runner version

GitLab Runner 16.6.0

Possible fixes

Increase poll_timeout under [runners.kubernetes] in runner configuration.