Kubernetes Runner Container pods taking too long to start in EKS: waiting for pod running: timed out waiting for pod to start.
Summary
The Runners are deployed in Azure AKS, AWS EKS, and GCP GKE. The runners in AKS and GKE appear to be mostly functioning as intended though we are getting some errors there. However, runners in EKS are not.
Internal Zendesk ticket for reference.
Actual behavior
Pods take too long to spawn and start causing them to get timed out.
Expected behavior
Pod runners to spawn and start faster.
Relevant logs and/or screenshots
ERROR: Job failed (system failure): pods \"\" not found
Waiting for pod sast-runner/runner- to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod sast-runner/runner- to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod sast-runner/runner- to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Environment description
- Config.toml file
listen_address = ":9252"
concurrent = 50
check_interval = 3
log_level = "info"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name =
output_limit = 40000
url =
id =
token =
token_obtained_at = 0000-00-00T00:00:00Z
token_expires_at = 0000-00-00T00:00:00Z
executor = "kubernetes"
[runners.custom_build_dir]
[runners.cache]
MaxUploadedArchiveSize = 0
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = ""
namespace = ""
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "500m"
cpu_request_overwrite_max_allowed = "7000m"
memory_limit = "5Gi"
memory_limit_overwrite_max_allowed = "8Gi"
memory_request = "1Gi"
memory_request_overwrite_max_allowed = "8Gi"
service_cpu_request = "500m"
service_cpu_request_overwrite_max_allowed = "2"
service_memory_limit = "2300Mi"
service_memory_limit_overwrite_max_allowed = "3Gi"
service_memory_request = "500Mi"
service_memory_request_overwrite_max_allowed = "3Gi"
helper_cpu_request = "800m"
helper_cpu_request_overwrite_max_allowed = "1.5"
helper_memory_limit = "2300Mi"
helper_memory_limit_overwrite_max_allowed = "3Gi"
helper_memory_request = "500Mi"
helper_memory_request_overwrite_max_allowed = "3Gi"
pull_policy = ["if-not-present"]
node_selector_overwrite_allowed = ""
helper_image = ""
pod_labels_overwrite_allowed = ""
service_account = ""
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
[runners.kubernetes.affinity]
[runners.kubernetes.pod_security_context]
[runners.kubernetes.init_permissions_container_security_context]
[runners.kubernetes.init_permissions_container_security_context.capabilities]
[runners.kubernetes.build_container_security_context]
[runners.kubernetes.build_container_security_context.capabilities]
[runners.kubernetes.helper_container_security_context]
[runners.kubernetes.helper_container_security_context.capabilities]
[runners.kubernetes.service_container_security_context]
[runners.kubernetes.service_container_security_context.capabilities]
[runners.kubernetes.volumes]
[runners.kubernetes.dns_config]
[runners.kubernetes.container_lifecycle]
- Pod logs:
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!
Configuration (with the authentication token) was saved in "/home/gitlab-runner/.gitlab-runner/config.toml"
Runtime platform arch=amd64 os=linux pid=7 revision=3046fee8 version=16.6.0
Starting multi-runner from /home/gitlab-runner/.gitlab-runner/config.toml... builds=0 max_builds=0
WARNING: Running in user-mode.
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...
There might be a problem with your config based on jsonschema annotations in common/config.go (experimental feature):
jsonschema: '/runners/0/kubernetes/affinity/pod_anti_affinity/required_during_scheduling_ignored_during_execution' does not validate with https://gitlab.com/gitlab-org/gitlab-runner/common/config#/$ref/properties/runners/items/$ref/properties/kubernetes/$ref/properties/affinity/$ref/properties/pod_anti_affinity/$ref/properties/required_during_scheduling_ignored_during_execution/type: expected array, but got null
- Job logs:
Running with gitlab-runner 16.6.0 (3046fee8)
on , system ID: r_ziC32FAbUHVg
feature flags: FF_PRINT_POD_EVENTS:true
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: lightspeed-runner
Using Kubernetes executor with image
Using attach strategy to execute scripts...
Preparing environment
03:04
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 4h0m0s...
Subscribing to Kubernetes Pod events...
Type Reason Message
Normal Scheduled Successfully assigned lightspeed-runner
Normal Pulling Pulling image "gitlab-runner/gitlab-runner-helper:x86_64-v16.5.0"
Normal Pulled Successfully pulled image "gitlab-runner-helper:x86_64-v16.5.0" in 36.483990843s (36.484013983s including waiting)
Normal Created Created container init-permissions
Normal Started Started container init-permissions
ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Used GitLab Runner version
GitLab Runner 16.6.0
Possible fixes
Increase poll_timeout under [runners.kubernetes] in runner configuration.