Jobs fail despite successful task completion when pods are evicted (since Runner 17.9)

Summary

We configure the Kubernetes executor with an increased termination_grace_period_seconds so that job pods that are evicted during node drain (e.g., as result of autoscaling) have a time window to complete their tasks and terminate gracefully before they are terminated forcefully
Since GitLab Runner 17.9.0, due to a new feature introduced in !5068 (merged), jobs of evicted pods are detected as failed even when they complete their task in time before being terminated

Steps to reproduce

Use GitLab Runner version 17.9.0 or 17.9.1 with the Kubernetes executor and pod_termination_grace_period_seconds = 300 in config.template.toml (see Environment description)
Trigger 10 jobs that sleep for 180 seconds via a new pipeline (see .gitlab-ci.yml) (test with enough pods to trigger the issue as it does not occur every time)
Wait for the new jobs to be scheduled on a Kuberntes node and drain this node once the pods have initialized successfully and are in the status Running:
```
kubectl drain  --delete-emptydir-data --ignore-daemonsets --force [NODE_NAME]
```
Observe that the runner pods are evicted but keep running until the two script commands have been executed
Observe that the gitlab-runner pod that spawned the runner pods detects the DisruptionTarget status condition of the runner pods (often with a delay or only at the end of the job run and sometimes not for all pods) and aborts the job, resulting in an immediate job error in the GitLab job log regardless of the runner pod still running

.gitlab-ci.yml

stages:
  - wait

.wait_job: &wait_job
  stage: wait
  image: busybox:latest
  script:
    - sleep 180
    - echo "Done"

wait_jobs:
  <<: *wait_job
  parallel:
    matrix:
      - WAIT_JOB: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Actual behavior

The runner pod is evicted when the node is drained
The pod continues to run for termination_grace_period_seconds second and finishes all job tasks
The pod is detected as evicted by the gitlab-runner pod and this result in a failed job with the error ERROR: Job failed (system failure): pod "ops/runner-[REDACTED]-project-820-concurrent-1-frrqaht1" is disrupted: reason "EvictionByEvictionAPI", message "Eviction API: evicting" in the job log in GitLab

Expected behavior

The runner pod is evicted when the node is drained
The pod continues to run for termination_grace_period_seconds second and finishes all job tasks
The job is listed as successful in the job log in the GitLab UI

Relevant logs and/or screenshots

job log (GitLab Runner 17.9.1)

Running with gitlab-runner 17.9.1 (bbf75488)
  on gitlab-runner-7c666d584-nsx8k [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "kubernetes" executor 00:00
"CPURequest" overwritten with "10m"
"MemoryRequest" overwritten with "50Mi"
"EphemeralStorageRequest" overwritten with "1Gi"
"CPULimit" overwritten with "10m"
"MemoryLimit" overwritten with "50Mi"
"EphemeralStorageLimit" overwritten with "1Gi"
Using Kubernetes namespace: [REDACTED]
Using Kubernetes executor with image busybox:latest ...
Using attach strategy to execute scripts...
Preparing environment 00:08
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-bxlybk59 to be running, status is Pending
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-bxlybk59 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-[REDACTED]-project-820-concurrent-0-bxlybk59 via gitlab-runner-7c666d584-nsx8k...
Getting source from Git repository 00:02
Fetching changes with git depth set to 20...
Initialized empty Git repository in [REDACTED]
Created fresh repository.
Checking out c2f5810f as detached HEAD (ref is main)...
Skipping Git submodules setup
Executing "step_script" stage of the job script 00:02
Cleaning up project directory and file based variables 03:01
$ sleep 180
$ echo "Done"
Done
ERROR: Job failed (system failure): pod "ops/runner-[REDACTED]-project-820-concurrent-0-bxlybk59" is disrupted: reason "EvictionByEvictionAPI", message "Eviction API: evicting"

job log (GitLab Runner 17.8.3)

Running with gitlab-runner 17.8.3 (690ce25c)
  on gitlab-runner-c97dd4c75-cmbnf [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "kubernetes" executor 00:00
"CPURequest" overwritten with "10m"
"MemoryRequest" overwritten with "50Mi"
"EphemeralStorageRequest" overwritten with "1Gi"
"CPULimit" overwritten with "10m"
"MemoryLimit" overwritten with "50Mi"
"EphemeralStorageLimit" overwritten with "1Gi"
Using Kubernetes namespace: [REDACTED]
Using Kubernetes executor with image busybox:latest ...
Using attach strategy to execute scripts...
Preparing environment 00:11
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-f6882k0j to be running, status is Pending
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-f6882k0j to be running, status is Pending
	ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod ops/runner-[REDACTED]-project-820-concurrent-0-f6882k0j to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-[REDACTED]-project-820-concurrent-0-f6882k0j via gitlab-runner-c97dd4c75-cmbnf...
Getting source from Git repository 00:02
Fetching changes with git depth set to 20...
Initialized empty Git repository in [REDACTED]
Created fresh repository.
Checking out c2f5810f as detached HEAD (ref is main)...
Skipping Git submodules setup
Executing "step_script" stage of the job script 03:04
$ sleep 180
$ echo "Done"
Done
Cleaning up project directory and file based variables 00:00
Job succeeded

Environment description

A self-managed GitLab instance and GitLab Runners with the Kubernetes executor (Kubernetes version 1.31) are used.

config.toml contents

shutdown_timeout = 0
concurrent = 20
check_interval = 5
log_level = "info"
log_format = "json"
listen_address = ":9252"

Kubernetes config.template.toml contents

[[runners]]
  [runners.kubernetes]
    image = "ubuntu:22.04"
    cpu_request = "50m"
    cpu_request_overwrite_max_allowed = "4"
    cpu_limit_overwrite_max_allowed = "4"
    memory_request = "500Mi"
    memory_request_overwrite_max_allowed = "8Gi"
    memory_limit = "3Gi"
    memory_limit_overwrite_max_allowed = "8Gi"
    ephemeral_storage_request = "500Mi"
    ephemeral_storage_request_overwrite_max_allowed = "20Gi"
    ephemeral_storage_limit = "5Gi"
    ephemeral_storage_limit_overwrite_max_allowed = "20Gi"
    service_cpu_request = "50m"
    service_memory_request = "500Mi"
    service_memory_limit = "3Gi"
    helper_cpu_request = "50m"
    helper_memory_request = "100Mi"
    helper_memory_limit = "100Mi"
    poll_timeout = 500
    pod_termination_grace_period_seconds = 300
    cleanup_grace_period_seconds = 1
    priority_class_name = "gitlab-runner-low"

Used GitLab Runner version

gitlab-runner --version:

Version:      17.9.1
Git revision: bbf75488
Git branch:   17-9-stable
GO version:   go1.23.2 X:cacheprog
Built:        2025-03-07T23:57:02Z
OS/Arch:      linux/amd64

First lines of the build log:

Running with gitlab-runner 17.9.1 (bbf75488)
  on gitlab-runner-6f4dfc4b76-lcg5x [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "kubernetes" executor 00:00
Using Kubernetes namespace: [REDACTED]
Using Kubernetes executor with image busybox:latest ...
Using attach strategy to execute scripts...
Preparing environment

Possible fixes

The new feature !5068 (merged) introduced the issue (see https://gitlab.com/gitlab-org/gitlab-runner/-/blame/main/executors/kubernetes/internal/watchers/pod.go#L178 for the error message being logged).

Maybe an additional check whether the pod has been terminated without an error could address this issue?