Kubernetes executor pods hanging when OOMKilled with cgroup v2 enabled

Summary

When using cgroup v2 on Kubernetes (e.g. AWS EKS with Amazon Linux 2023) the GitLab Runner does not detect and terminate executor pods when they are OOMKilled. Instead the job will hang until the timeout is reached and will then be terminated, blocking resources on the cluster during that hanging time.

Steps to reproduce

Deploy GitLab Runner to Kubernetes with cgroup v2 enabled (e.g. AWS EKS with Amazon Linux 2023)
Run a job which consumes more memory than is set in the memory limit.

Actual behavior

Executor pod is OOMKilled and will not be cleaned up & terminated by the GitLab Runner until the job timeout is reached.

Expected behavior

Executor pod will be cleaned up & terminated soon after being OOMKilled by the Gitlab Runner and the job will be marked as failed due to OOM.

Relevant logs and/or screenshots

n/a

Environment description

Runner deployed to Kubernetes cluster running on AWS EKS 1.30 with Amazon Linux 2023 nodes using cgroup v2.

config.toml contents

[[runners]]
  clone_url = "xxxxxxxxx"
  output_limit = 16384
  [runners.kubernetes]
    image = "xxxxxxxxx"
    helper_image = "xxxxxxxxx"
    image_pull_secrets = ["xxxxxxxxx"]
    poll_timeout = 600
    cpu_request = "1000m"
    cpu_request_overwrite_max_allowed = "2000m"
    memory_request = "1024Mi"
    memory_request_overwrite_max_allowed = "8192Mi"
    memory_limit = "8192Mi"
    helper_cpu_request = "50m"
    helper_memory_request = "200Mi"
    helper_memory_limit = "2048Mi"
    service_cpu_request = "500m"
    service_cpu_request_overwrite_max_allowed = "2000m"
    service_memory_request = "512Mi"
    service_memory_request_overwrite_max_allowed = "8192Mi"
    service_memory_limit = "8192Mi"
    [runners.kubernetes.pod_labels]
      "workload" = "ci"
    [runners.kubernetes.pod_annotations]
      "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
    [runners.kubernetes.node_selector]
      "workload" = "ci"
      "arch" = "amd64"
    [runners.kubernetes.node_tolerations]
      "workload=ci" = "NoSchedule"
    [runners.kubernetes.affinity]
      [runners.kubernetes.affinity.pod_affinity]
        [[runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution]]
        weight = 100
        [runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution.pod_affinity_term]
          topology_key = "kubernetes.io/hostname"
          [runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution.pod_affinity_term.label_selector]
            [[runners.kubernetes.affinity.pod_affinity.preferred_during_scheduling_ignored_during_execution.pod_affinity_term.label_selector.match_expressions]]
              key = "workload"
              operator = "In"
              values = ["ci"]
    [runners.feature_flags]
      FF_GITLAB_REGISTRY_HELPER_IMAGE = true
    [runners.cache]
      Type = "s3"
      Path = "xxxxxxxxx"
      Shared = true
      [runners.cache.s3]
        ServerAddress = "xxxxxxxxx"
        BucketName = "xxxxxxxxx"
        BucketLocation = "xxxxxxxxx"
        Insecure = false
        AuthenticationType = "iam"
    [runners.custom_build_dir]
      enabled = true

Used GitLab Runner version

Possible fixes

n/a

Edited Oct 11, 2024 by Tobias Germer