Kubernetes executor pods hanging when OOMKilled with cgroup v2 enabled
Summary
When using cgroup v2 on Kubernetes (e.g. AWS EKS with Amazon Linux 2023) the GitLab Runner does not detect and terminate executor pods when they are OOMKilled. Instead the job will hang until the timeout is reached and will then be terminated, blocking resources on the cluster during that hanging time.
Steps to reproduce
- Deploy GitLab Runner to Kubernetes with cgroup v2 enabled (e.g. AWS EKS with Amazon Linux 2023)
- Run a job which consumes more memory than is set in the memory limit.
Actual behavior
Executor pod is OOMKilled and will not be cleaned up & terminated by the GitLab Runner until the job timeout is reached.
Expected behavior
Executor pod will be cleaned up & terminated soon after being OOMKilled by the Gitlab Runner and the job will be marked as failed due to OOM.
Relevant logs and/or screenshots
n/a
Environment description
Runner deployed to Kubernetes cluster running on AWS EKS 1.30 with Amazon Linux 2023 nodes using cgroup v2.
config.toml contents
[[runners]]
clone_url = "xxxxxxxxx"
output_limit = 16384
[runners.kubernetes]
image = "xxxxxxxxx"
helper_image = "xxxxxxxxx"
image_pull_secrets = ["xxxxxxxxx"]
poll_timeout = 600
cpu_request = "1000m"
cpu_request_overwrite_max_allowed = "2000m"
memory_request = "1024Mi"
memory_request_overwrite_max_allowed = "8192Mi"
memory_limit = "8192Mi"
helper_cpu_request = "50m"
helper_memory_request = "200Mi"
helper_memory_limit = "2048Mi"
service_cpu_request = "500m"
service_cpu_request_overwrite_max_allowed = "2000m"
service_memory_request = "512Mi"
service_memory_request_overwrite_max_allowed = "8192Mi"
service_memory_limit = "8192Mi"
[runners.kubernetes.pod_labels]
"workload" = "ci"
[runners.kubernetes.pod_annotations]
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
[runners.kubernetes.node_selector]
"workload" = "ci"
"arch" = "amd64"
[runners.kubernetes.node_tolerations]
"workload=ci" = "NoSchedule"
[runners.kubernetes.affinity]
[runners.kubernetes.affinity.pod_affinity]
[[runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution]]
weight = 100
[runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution.pod_affinity_term]
topology_key = "kubernetes.io/hostname"
[runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution.pod_affinity_term.label_selector]
[[runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution.pod_affinity_term.label_selector.match_expressions]]
key = "workload"
operator = "In"
values = ["ci"]
[runners.feature_flags]
FF_GITLAB_REGISTRY_HELPER_IMAGE = true
[runners.cache]
Type = "s3"
Path = "xxxxxxxxx"
Shared = true
[runners.cache.s3]
ServerAddress = "xxxxxxxxx"
BucketName = "xxxxxxxxx"
BucketLocation = "xxxxxxxxx"
Insecure = false
AuthenticationType = "iam"
[runners.custom_build_dir]
enabled = true
Used GitLab Runner version
Possible fixes
n/a
Edited by Tobias Germer