Jobs fail despite successful task completion when pods are evicted (since Runner 17.9)
Summary
- We configure the Kubernetes executor with an increased termination_grace_period_seconds so that job pods that are evicted during node drain (e.g., as result of autoscaling) have a time window to complete their tasks and terminate gracefully before they are terminated forcefully
- Since GitLab Runner 17.9.0, due to a new feature introduced in !5068 (merged), jobs of evicted pods are detected as failed even when they complete their task in time before being terminated
Steps to reproduce
- Use GitLab Runner version 17.9.0 or 17.9.1 with the Kubernetes executor and
pod_termination_grace_period_seconds = 300inconfig.template.toml(see Environment description) - Trigger 10 jobs that sleep for 180 seconds via a new pipeline (see
.gitlab-ci.yml) (test with enough pods to trigger the issue as it does not occur every time) - Wait for the new jobs to be scheduled on a Kuberntes node and drain this node once the pods have initialized successfully and are in the status
Running:kubectl drain --delete-emptydir-data --ignore-daemonsets --force [NODE_NAME] - Observe that the runner pods are evicted but keep running until the two
scriptcommands have been executed - Observe that the
gitlab-runnerpod that spawned the runner pods detects theDisruptionTargetstatus condition of the runner pods (often with a delay or only at the end of the job run and sometimes not for all pods) and aborts the job, resulting in an immediate job error in the GitLab job log regardless of the runner pod still running
.gitlab-ci.yml
stages:
- wait
.wait_job: &wait_job
stage: wait
image: busybox:latest
script:
- sleep 180
- echo "Done"
wait_jobs:
<<: *wait_job
parallel:
matrix:
- WAIT_JOB: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Actual behavior
- The runner pod is evicted when the node is drained
- The pod continues to run for termination_grace_period_seconds second and finishes all job tasks
- The pod is detected as evicted by the
gitlab-runnerpod and this result in a failed job with the errorERROR: Job failed (system failure): pod "ops/runner-[REDACTED]-project-820-concurrent-1-frrqaht1" is disrupted: reason "EvictionByEvictionAPI", message "Eviction API: evicting"in the job log in GitLab
Expected behavior
- The runner pod is evicted when the node is drained
- The pod continues to run for termination_grace_period_seconds second and finishes all job tasks
- The job is listed as successful in the job log in the GitLab UI
Relevant logs and/or screenshots
job log (GitLab Runner 17.9.1)
Running with gitlab-runner 17.9.1 (bbf75488)
on gitlab-runner-7c666d584-nsx8k [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "kubernetes" executor 00:00
"CPURequest" overwritten with "10m"
"MemoryRequest" overwritten with "50Mi"
"EphemeralStorageRequest" overwritten with "1Gi"
"CPULimit" overwritten with "10m"
"MemoryLimit" overwritten with "50Mi"
"EphemeralStorageLimit" overwritten with "1Gi"
Using Kubernetes namespace: [REDACTED]
Using Kubernetes executor with image busybox:latest ...
Using attach strategy to execute scripts...
Preparing environment 00:08
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-bxlybk59 to be running, status is Pending
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-bxlybk59 to be running, status is Pending
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-[REDACTED]-project-820-concurrent-0-bxlybk59 via gitlab-runner-7c666d584-nsx8k...
Getting source from Git repository 00:02
Fetching changes with git depth set to 20...
Initialized empty Git repository in [REDACTED]
Created fresh repository.
Checking out c2f5810f as detached HEAD (ref is main)...
Skipping Git submodules setup
Executing "step_script" stage of the job script 00:02
Cleaning up project directory and file based variables 03:01
$ sleep 180
$ echo "Done"
Done
ERROR: Job failed (system failure): pod "ops/runner-[REDACTED]-project-820-concurrent-0-bxlybk59" is disrupted: reason "EvictionByEvictionAPI", message "Eviction API: evicting"
job log (GitLab Runner 17.8.3)
Running with gitlab-runner 17.8.3 (690ce25c)
on gitlab-runner-c97dd4c75-cmbnf [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "kubernetes" executor 00:00
"CPURequest" overwritten with "10m"
"MemoryRequest" overwritten with "50Mi"
"EphemeralStorageRequest" overwritten with "1Gi"
"CPULimit" overwritten with "10m"
"MemoryLimit" overwritten with "50Mi"
"EphemeralStorageLimit" overwritten with "1Gi"
Using Kubernetes namespace: [REDACTED]
Using Kubernetes executor with image busybox:latest ...
Using attach strategy to execute scripts...
Preparing environment 00:11
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-f6882k0j to be running, status is Pending
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-f6882k0j to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-f6882k0j to be running, status is Pending
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-[REDACTED]-project-820-concurrent-0-f6882k0j via gitlab-runner-c97dd4c75-cmbnf...
Getting source from Git repository 00:02
Fetching changes with git depth set to 20...
Initialized empty Git repository in [REDACTED]
Created fresh repository.
Checking out c2f5810f as detached HEAD (ref is main)...
Skipping Git submodules setup
Executing "step_script" stage of the job script 03:04
$ sleep 180
$ echo "Done"
Done
Cleaning up project directory and file based variables 00:00
Job succeeded
Environment description
A self-managed GitLab instance and GitLab Runners with the Kubernetes executor (Kubernetes version 1.31) are used.
config.toml contents
shutdown_timeout = 0
concurrent = 20
check_interval = 5
log_level = "info"
log_format = "json"
listen_address = ":9252"
Kubernetes config.template.toml contents
[[runners]]
[runners.kubernetes]
image = "ubuntu:22.04"
cpu_request = "50m"
cpu_request_overwrite_max_allowed = "4"
cpu_limit_overwrite_max_allowed = "4"
memory_request = "500Mi"
memory_request_overwrite_max_allowed = "8Gi"
memory_limit = "3Gi"
memory_limit_overwrite_max_allowed = "8Gi"
ephemeral_storage_request = "500Mi"
ephemeral_storage_request_overwrite_max_allowed = "20Gi"
ephemeral_storage_limit = "5Gi"
ephemeral_storage_limit_overwrite_max_allowed = "20Gi"
service_cpu_request = "50m"
service_memory_request = "500Mi"
service_memory_limit = "3Gi"
helper_cpu_request = "50m"
helper_memory_request = "100Mi"
helper_memory_limit = "100Mi"
poll_timeout = 500
pod_termination_grace_period_seconds = 300
cleanup_grace_period_seconds = 1
priority_class_name = "gitlab-runner-low"
Used GitLab Runner version
-
gitlab-runner --version:Version: 17.9.1 Git revision: bbf75488 Git branch: 17-9-stable GO version: go1.23.2 X:cacheprog Built: 2025-03-07T23:57:02Z OS/Arch: linux/amd64 -
First lines of the build log:
Running with gitlab-runner 17.9.1 (bbf75488) on gitlab-runner-6f4dfc4b76-lcg5x [REDACTED], system ID: [REDACTED] Resolving secrets Preparing the "kubernetes" executor 00:00 Using Kubernetes namespace: [REDACTED] Using Kubernetes executor with image busybox:latest ... Using attach strategy to execute scripts... Preparing environment
Possible fixes
The new feature !5068 (merged) introduced the issue (see https://gitlab.com/gitlab-org/gitlab-runner/-/blame/main/executors/kubernetes/internal/watchers/pod.go#L178 for the error message being logged).
Maybe an additional check whether the pod has been terminated without an error could address this issue?