Gitlab k8s runner changed the status from runner_system_failure to script_failure when job pod has been unexpectedly terminated
Summary
In the earliest versions of Gitlab, we had retry logic based on job failure status(runner_system_failure) if the job executor(k8s pod) has been terminated unexpectedly. It's not working as expected in current versions.
Steps to reproduce
If you have jobs using kubernetes runners and you have retry logic added to the job when runner_system_failure
set in this case job will not be retried if the job pod has been terminated unexpectedly(like underlying node has been removed by provider, eg. AWS Cloud with Spot Instances)
.gitlab-ci.yml
stages:
- test
Test:
stage: test
image: any_image
retry:
max: 2
when: runner_system_failure
script:
- some_script
tags:
- kubernetes # running the job using kubernetes runner
Actual behavior
The job should be retried in the case of runner_system_failure
happens. But this is not happening, in the same time hovering on the failed job shows script failure
when the pod has been deleted which indicated the reason why job failed was script_failure
instead of runner_system_failure
Expected behavior
The job should be retried based on when: runner_system_failure
Relevant logs and/or screenshot
job log
Running with gitlab-runner 13.3.0 (86ad88ea)
on gitlab-runner-gitlab-runner-864df88976-pv7rt mB4fmPhH
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab-runner
Using Kubernetes executor with image artifactory.a-us-common.wfk8s.com/docker/globalqa/pipeline-trigger:latest ...
Using attach strategy to execute scripts...
Preparing environment
01:16
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Running on runner-mb4fmphh-project-3560-concurrent-5xb489 via gitlab-runner-gitlab-runner-864df88976-pv7rt...
Getting source from Git repository
00:16
Fetching changes with git depth set to 10...
Initialized empty Git repository in /builds/product/web-application/.git/
Created fresh repository.
Checking out 3a15ec09 as refs/merge-requests/15072/merge...
Skipping Git submodules setup
Executing "step_script" stage of the job script
05:45
$ ./ci/create-storm.sh
Redrock Repo is: product/web-application
Redrock branch is: marvel-test-initiative-clint
Storm will be created using this repo, branch and pitboss env for Redrock: product/web-application, marvel-test-initiative-clint, od and this repo and branch for QS: product/quicksilver, master
2020-09-01 16:50:07,547 - 12 - [INFO] Redrock Repo is: product/web-application
2020-09-01 16:50:07,547 - 12 - [INFO] Redrock Branch is: marvel-test-initiative-clint
2020-09-01 16:50:07,547 - 12 - [INFO] Creating Storm in FastCI
2020-09-01 16:50:07,547 - 12 - [INFO] This storm will be created in fastci area with name GitLab-Redrock-Storm-701339.
2020-09-01 16:50:07,815 - 12 - [INFO] Storm instance has been created. Redrock repo: product/web-application , redrock branch: marvel-test-initiative-clint, pitboss environment: od
2020-09-01 16:50:07,816 - 12 - [INFO] Storm Environment Link: https://ace.workfront.tech/devops/#/StormV3/fastci/5f4e7bbf463cf6000f61225e
2020-09-01 16:50:07,816 - 12 - [INFO] Waiting for Storm with id 5f4e7bbf463cf6000f61225e to get Ready...
2020-09-01 16:50:07,925 - 12 - [INFO] Sleeping 120 seconds for initial environment creation:
2020-09-01 16:52:23,254 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:52:38,409 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:52:53,670 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:08,876 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:24,035 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:39,250 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:54,435 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:09,597 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:24,759 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:39,957 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:55,120 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:55:10,276 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:55:25,530 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:55:40,742 - 12 - [INFO] Storm environment status is: BUSY
ERROR: Job failed: pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found
runner manager log
Appending trace to coordinator... ok code=202 job=3306281 job-log=0-6137 job-status=running runner=mB4fmPhH sent-log=5993-6136 status=202 Accepted update-interval=30s
Feeding runners to channel builds=9
Checking for jobs... nothing runner=mB4fmPhH
Feeding runners to channel builds=9
Checking for jobs... nothing runner=mB4fmPhH
Feeding runners to channel builds=9
Checking for jobs... nothing runner=mB4fmPhH
Executing build stage build_stage=after_script job=3306281 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do) build_stage=after_script job=3306281 project=3560 runner=mB4fmPhH
Executing build stage build_stage=upload_artifacts_on_failure job=3306281 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do) build_stage=upload_artifacts_on_failure job=3306281 project=3560 runner=mB4fmPhH
Skipping referees execution job=3306281 project=3560 runner=mB4fmPhH
WARNING: Job failed: pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found duration=7m17.095383378s job=3306281 project=3560 runner=mB4fmPhH
Appending trace to coordinator... ok code=202 job=3306281 job-log=0-6343 job-status=running runner=mB4fmPhH sent-log=6137-6342 status=202 Accepted update-interval=30s
Submitting job to coordinator... ok code=200 job=3306281 job-status= runner=mB4fmPhH
ERROR: Error cleaning up pod: pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found job=3306281 project=3560 runner=mB4fmPhH
WARNING: Failed to process runner builds=8 error=pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found executor=kubernetes runner=mB4fmPhH
WARNING: Error streaming logs gitlab-runner/runner-mb4fmphh-project-3560-concurrent-4wdb4w/helper:/builds/product/web-application.tmp/logs/output.log: command terminated with exit code 137. Retrying... job=3306279 project=3560 runner=mB4fmPhH
Backing off reattaching log for gitlab-runner/runner-mb4fmphh-project-3560-concurrent-4wdb4w/helper:/builds/product/web-application.tmp/logs/output.log for 2s job=3306279 project=3560 runner=mB4fmPhH
Backing off reattaching log for gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489/helper:/builds/product/web-application.tmp/logs/output.log for 2s job=3306281 project=3560 runner=mB4fmPhH
Detaching from log... context canceled job=3306281 project=3560 runner=mB4fmPhH
Executing build stage build_stage=after_script job=3306279 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do) build_stage=after_script job=3306279 project=3560 runner=mB4fmPhH
Executing build stage build_stage=upload_artifacts_on_failure job=3306279 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do) build_stage=upload_artifacts_on_failure job=3306279 project=3560 runner=mB4fmPhH
Skipping referees execution job=3306279 project=3560 runner=mB4fmPhH
WARNING: Job failed: pods "runner-mb4fmphh-project-3560-concurrent-4wdb4w" not found duration=7m18.311197726s job=3306279 project=3560 runner=mB4fmPhH
Detaching from log... context canceled job=3306279 project=3560 runner=mB4fmPhH
Appending trace to coordinator... ok code=202 job=3306279 job-log=0-5115 job-status=running runner=mB4fmPhH sent-log=4981-5114 status=202 Accepted update-interval=30s
Submitting job to coordinator... ok code=200 job=3306279 job-status= runner=mB4fmPhH
ERROR: Error cleaning up pod: pods "runner-mb4fmphh-project-3560-concurrent-4wdb4w" not found job=3306279 project=3560 runner=mB4fmPhH
WARNING: Failed to process runner builds=7 error=pods "runner-mb4fmphh-project-3560-concurrent-4wdb4w" not found executor=kubernetes runner=mB4fmPhH
Environment description
We are using our own on premise deployment for both Gitlab and Gitlab Runners. We are using official helm charts provided by Gitlab for Gitlab Runners(v13.3.0)
config.toml contents
listen_address = ":9252"
concurrent = 600
check_interval = 5
log_level = "debug"
sentry_dsn = "https://sentry_url"
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlab-runner-gitlab-runner-864df88976-fb97n"
output_limit = 15000
request_concurrency = 1
url = "https://gitlab_url/"
token = "token"
executor = "kubernetes"
environment = [some_vars]
[runners.custom_build_dir]
[runners.cache]
Type = "s3"
Path = "gitlab_runner"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
AccessKey = "key"
SecretKey = "key"
BucketName = "bucket_name"
BucketLocation = "us-west-2"
[runners.cache.gcs]
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = "docker:dind"
namespace = "gitlab-runner"
namespace_overwrite_allowed = ""
privileged = true
cpu_limit = "400m"
cpu_limit_overwrite_max_allowed = "6"
memory_limit = "2Gi"
memory_limit_overwrite_max_allowed = "16Gi"
service_cpu_limit = "200m"
service_memory_limit = "512Mi"
helper_cpu_limit = "1"
helper_memory_limit = "3Gi"
cpu_request = "300m"
cpu_request_overwrite_max_allowed = "3"
memory_request = "2Gi"
memory_request_overwrite_max_allowed = "16Gi"
service_cpu_request = "100m"
service_memory_request = "512Mi"
helper_cpu_request = "500m"
helper_memory_request = "2Gi"
pull_policy = "always"
poll_timeout = 540
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
[runners.kubernetes.node_selector]
application = "gitlab-runner"
[runners.kubernetes.node_tolerations]
"role=spot-instance" = "NoSchedule"
[runners.kubernetes.pod_annotations]
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
[runners.kubernetes.pod_security_context]
[runners.kubernetes.volumes]
[[runners.kubernetes.volumes.host_path]]
host_path = "/var/run/docker.sock"
mount_path = "/var/run/docker.sock"
name = "docker"
read_only = false
[[runners.kubernetes.volumes.config_map]]
mount_path = "/root/.docker"
name = "docker-config"
read_only = false
[runners.kubernetes.volumes.config_map.items]
"config.json" = "config.json"
[[runners.kubernetes.volumes.config_map]]
mount_path = "/root/.m2"
name = "mvn-config"
read_only = false
[runners.kubernetes.volumes.config_map.items]
"settings.xml" = "settings.xml"
[[runners.kubernetes.volumes.config_map]]
mount_path = "/root/etc/"
name = "npm-config"
read_only = false
[runners.kubernetes.volumes.config_map.items]
npmrc = "npmrc"
[[runners.kubernetes.volumes.pvc]]
mount_path = "/root/ssh"
name = "gitlab-runner"
read_only = false
Used GitLab Runner version
gitlab-runner --version
Version: 13.3.0
Git revision: 86ad88ea
Git branch: 13-3-stable
GO version: go1.13.8
Built: 2020-08-20T06:29:34+0000
OS/Arch: linux/amd64