Skip to content

Gitlab runner Pods stuck in "Terminating" state on AWS EKS 1.23

Status update - 2024-03-20

we recently introduced a feature flag FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR which seems to fix the issue (see thread 👉🏿 #30230 (comment 1757639559)).

The feature flag was introduced in the following milestone

Further improvements for dumb-init support are still needed and are being addressed in this issue #37381.

Summary

We use GitLab runners with the Kubernetes executor. And it seems that every job the runner starts does not properly terminate, but gets stuck in the "Terminating" state.

Steps to reproduce

  • run jobs on a Kubernetes cluster
  • after the job is (successfully or not) finished, view Pods on the cluster
  • notice the job's Pod stuck in "Terminating" state

Actual behavior

  • Pods stuck in "Terminating" state

Expected behavior

  • job Pods being cleaned up and vanishing

Relevant logs and/or screenshots

  • it seems that the process is not even present on the host level any more (ssh, docker ps)
  • yet the containers within the job Pods show "ERROR" state
  • job results (success, failure) or project seems to make no difference
  • happens with all GitLab runners I deployed
screenshots

terminating-overview

terminating-inner-containers

/var/log/messages for a "Terminated" container
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.823617    3693 kubelet.go:2120] "SyncLoop ADD" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946041    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946155    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946217    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946241    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946279    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946307    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047362    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047460    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047533    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047570    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047643    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047699    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047815    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048095    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048277    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.050600    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.061048    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.062031    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.159234    3693 kuberuntime_manager.go:487] "No sandbox for pod can be found. Need to start a new one" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.835414    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873037    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873611    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:98f1518e3d2f4ba7d66e1fd5238e49e928f81c5fe044e7bcbf408ec98ff7e45c}
Mar  8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926561    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar  8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926604    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.802812    3693 kubelet.go:2136] "SyncLoop DELETE" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.804708    3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="build" containerID="docker://7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d" gracePeriod=1
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.805281    3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="helper" containerID="docker://c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc" gracePeriod=1
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705701    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705852    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705876    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:37:41 ip-10-0-136-6 kubelet: I0308 09:37:41.984526    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7co    

Environment description

  • AWS EKS 1.23
  • multiple GitLab runners with different tags
config.toml contents
[[runners]]
environment = ["DOCKER_AUTH_CONFIG={\"auths\":{\"https://index.docker.io/v1/\":{\"auth\":\"{{ `{{ssm /service/prod/gitlab/DOCKER_AUTH_TOKEN eu-central-1}}` }}\"}}}"]
[runners.kubernetes]
  image = "ubuntu:22.04"

  # THIS MUST GO. USE KANIKO TO BUILD CONTAINERS.
  privileged = true

  service_account = "gitlab-jobs"

  # after that many seconds the job fails if the container is not ready by then.
  # crazy long, because because of one of our service's startup time on fargate.
  poll_timeout = 600

  cpu_limit = "3"
  cpu_request = "500m"
  memory_limit = "8Gi"
  memory_request = "500Mi"

  helper_cpu_limit = "250m"
  helper_cpu_request = "250m"
  helper_memory_limit = "8Gi"
  helper_memory_request = "128Mi"

  service_cpu_limit = "4"
  service_cpu_request = "200m"
  service_memory_limit = "8Gi"
  service_memory_request = "128Mi"

  [runners.kubernetes.pod_labels]
    "gitlab.com/project-id" = "${CI_PROJECT_ID}"
    "gitlab.com/project-name" = "${CI_PROJECT_NAME}"
    "gitlab.com/project-path" = "${CI_PROJECT_PATH}"
    "job.runner.gitlab.com/runner-name" = "cluster"

  [runners.kubernetes.pod_annotations]
    "job.runner.gitlab.com/pipeline-url" = "${CI_PIPELINE_URL}"

[runners.cache]
  Type = "s3"
  Path = "runners-all"
  Shared = true
  [runners.cache.s3]
    ServerAddress = "s3.amazonaws.com"
    BucketName = "our-company-gitlab-runner-cache"
    BucketLocation = "eu-central-1"
    Insecure = false

[[runners.kubernetes.volumes.empty_dir]]
  name = "docker-certs"
  mount_path = "/certs/client"
  medium = "Memory"

Used GitLab Runner version

helm chart v0.50.1, before also 0.47.0