Gitlab k8s runner status script_failure instead of runner_system_failure on AWS Spot instances
Summary
It seems like there has been a regression of !2444 (merged) (the issue for this MR is #26856 (closed)). A gitlab.com customer that runs their kubernetes pods on AWS Spot instances reported that when a job pod is evicted while the job is running, it is marked as a script failure instead of a runner system failure.
Sometimes this shows up in the job log as an issue connecting to the Docker daemon:
ERROR: Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?
Other times, it's more obvious:
ERROR: Error cleaning up pod: pods "runner-qn7qyr8ex-project-40783171-concurrent-7-3ud21hdh" not found
The jobs' final error message is always:
ERROR: Job failed: command terminated with exit code 1
#28100 (closed) seems to be essentially the same issue, just on GCP preemptible nodes instead of AWS Spot instances.
Steps to reproduce
Running a Kubernetes cluster on AWS Spot instances, and with retry:when set for runner_system_failure in the CI config, the job will not be retried if the job pod has been terminated unexpectedly.
.gitlab-ci.yml
stages:
- test
Test:
stage: test
image: any_image
retry:
max: 2
when: runner_system_failure
script:
- sleep 7200
tags:
- kubernetes # running the job using kubernetes runner
Actual behavior
When a pod eviction causes a job to fail, the failure reason is marked as script_failure.
Expected behavior
When a pod eviction causes a job to fail, the failure reason is marked as runner_system_failure.
Relevant logs and/or screenshots
job log
Add the job log
Environment description
config.toml contents
[[runners]]
output_limit = 16834
[runners.kubernetes]
namespace = "<namespace>"
image = "ubuntu:22.04"
privileged = true
poll_timeout = 3600
cpu_request = "2"
cpu_limit = "8"
memory_request = "16Gi"
memory_limit = "16Gi"
helper_cpu_request = "0.5"
service_cpu_request = "1"
cleanup_grace_period_seconds = 60
pod_termination_grace_period_seconds = 120
[runners.kubernetes.container_lifecycle.post_start.exec]
command = ["mkdir", "-p", "/builds/tmp"]
[runners.kubernetes.pod_annotations]
"karpenter.sh/do-not-evict" = "true"
[runners.kubernetes.node_selector]
"kubernetes.io/arch" = "amd64"
"kubernetes.io/os" = "linux"
"karpenter.sh/nodepool" = "<nodepool-name>"
[[runners.kubernetes.pod_spec]]
name = "<pod-name>"
patch_type = "strategic"
patch = '''
containers:
- name: build
volumeMounts:
- name: repo
mountPath: /builds
- name: helper
volumeMounts:
- name: repo
mountPath: /builds
volumes:
- name: repo
emptyDir: null
ephemeral:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: [ ReadWriteOnce ]
csi:
fsType: xfs
resources:
requests:
storage: 384Gi
'''
[runners.cache]
Type = "s3"
Path = "<path>"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "<bucket-name>"
BucketLocation = "<location>"
[runners.feature_flags]
FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true
FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true
FF_PRINT_POD_EVENTS = true
FF_USE_FASTZIP = true
Used GitLab Runner version
17.0.0 (44feccdf), kubernetes executor