Skip to content

Gitlab k8s runner changed the status from runner_system_failure to script_failure when job pod has been unexpectedly terminated

Summary

In the earliest versions of Gitlab, we had retry logic based on job failure status(runner_system_failure) if the job executor(k8s pod) has been terminated unexpectedly. It's not working as expected in current versions.

Steps to reproduce

If you have jobs using kubernetes runners and you have retry logic added to the job when runner_system_failure set in this case job will not be retried if the job pod has been terminated unexpectedly(like underlying node has been removed by provider, eg. AWS Cloud with Spot Instances)

.gitlab-ci.yml
stages:
  - test

Test:
  stage: test
  image: any_image
  retry:
    max: 2
    when: runner_system_failure
  script:
    - some_script 
  tags:
    - kubernetes # running the job using kubernetes runner

Actual behavior

The job should be retried in the case of runner_system_failure happens. But this is not happening, in the same time hovering on the failed job shows script failure when the pod has been deleted which indicated the reason why job failed was script_failure instead of runner_system_failure

Expected behavior

The job should be retried based on when: runner_system_failure

Relevant logs and/or screenshot

job log
Running with gitlab-runner 13.3.0 (86ad88ea)
  on gitlab-runner-gitlab-runner-864df88976-pv7rt mB4fmPhH
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab-runner
Using Kubernetes executor with image artifactory.a-us-common.wfk8s.com/docker/globalqa/pipeline-trigger:latest ...
Using attach strategy to execute scripts...
Preparing environment
01:16
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Waiting for pod gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489 to be running, status is Pending
Running on runner-mb4fmphh-project-3560-concurrent-5xb489 via gitlab-runner-gitlab-runner-864df88976-pv7rt...
Getting source from Git repository
00:16
Fetching changes with git depth set to 10...
Initialized empty Git repository in /builds/product/web-application/.git/
Created fresh repository.
Checking out 3a15ec09 as refs/merge-requests/15072/merge...
Skipping Git submodules setup
Executing "step_script" stage of the job script
05:45
$ ./ci/create-storm.sh
Redrock Repo is: product/web-application
Redrock branch is: marvel-test-initiative-clint
Storm will be created using this repo, branch and pitboss env for Redrock: product/web-application, marvel-test-initiative-clint, od  and this repo and branch for QS: product/quicksilver, master
2020-09-01 16:50:07,547 - 12 - [INFO] Redrock Repo is: product/web-application
2020-09-01 16:50:07,547 - 12 - [INFO] Redrock Branch is: marvel-test-initiative-clint
2020-09-01 16:50:07,547 - 12 - [INFO] Creating Storm in FastCI
2020-09-01 16:50:07,547 - 12 - [INFO] This storm will be created in fastci area with name GitLab-Redrock-Storm-701339.
2020-09-01 16:50:07,815 - 12 - [INFO] Storm instance has been created. Redrock repo: product/web-application , redrock branch: marvel-test-initiative-clint, pitboss environment: od 
2020-09-01 16:50:07,816 - 12 - [INFO] Storm Environment Link: https://ace.workfront.tech/devops/#/StormV3/fastci/5f4e7bbf463cf6000f61225e
2020-09-01 16:50:07,816 - 12 - [INFO] Waiting for Storm with id 5f4e7bbf463cf6000f61225e to get Ready...
2020-09-01 16:50:07,925 - 12 - [INFO] Sleeping 120 seconds for initial environment creation: 
2020-09-01 16:52:23,254 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:52:38,409 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:52:53,670 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:08,876 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:24,035 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:39,250 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:53:54,435 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:09,597 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:24,759 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:39,957 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:54:55,120 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:55:10,276 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:55:25,530 - 12 - [INFO] Storm environment status is: BUSY
2020-09-01 16:55:40,742 - 12 - [INFO] Storm environment status is: BUSY
ERROR: Job failed: pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found
runner manager log
Appending trace to coordinator... ok                code=202 job=3306281 job-log=0-6137 job-status=running runner=mB4fmPhH sent-log=5993-6136 status=202 Accepted update-interval=30s
Feeding runners to channel                          builds=9
Checking for jobs... nothing                        runner=mB4fmPhH
Feeding runners to channel                          builds=9
Checking for jobs... nothing                        runner=mB4fmPhH
Feeding runners to channel                          builds=9
Checking for jobs... nothing                        runner=mB4fmPhH
Executing build stage                               build_stage=after_script job=3306281 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do)                      build_stage=after_script job=3306281 project=3560 runner=mB4fmPhH
Executing build stage                               build_stage=upload_artifacts_on_failure job=3306281 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do)                      build_stage=upload_artifacts_on_failure job=3306281 project=3560 runner=mB4fmPhH
Skipping referees execution                         job=3306281 project=3560 runner=mB4fmPhH
WARNING: Job failed: pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found  duration=7m17.095383378s job=3306281 project=3560 runner=mB4fmPhH
Appending trace to coordinator... ok                code=202 job=3306281 job-log=0-6343 job-status=running runner=mB4fmPhH sent-log=6137-6342 status=202 Accepted update-interval=30s
Submitting job to coordinator... ok                 code=200 job=3306281 job-status= runner=mB4fmPhH
ERROR: Error cleaning up pod: pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found  job=3306281 project=3560 runner=mB4fmPhH
WARNING: Failed to process runner                   builds=8 error=pods "runner-mb4fmphh-project-3560-concurrent-5xb489" not found executor=kubernetes runner=mB4fmPhH
WARNING: Error streaming logs gitlab-runner/runner-mb4fmphh-project-3560-concurrent-4wdb4w/helper:/builds/product/web-application.tmp/logs/output.log: command terminated with exit code 137. Retrying...  job=3306279 project=3560 runner=mB4fmPhH
Backing off reattaching log for gitlab-runner/runner-mb4fmphh-project-3560-concurrent-4wdb4w/helper:/builds/product/web-application.tmp/logs/output.log for 2s  job=3306279 project=3560 runner=mB4fmPhH
Backing off reattaching log for gitlab-runner/runner-mb4fmphh-project-3560-concurrent-5xb489/helper:/builds/product/web-application.tmp/logs/output.log for 2s  job=3306281 project=3560 runner=mB4fmPhH
Detaching from log... context canceled              job=3306281 project=3560 runner=mB4fmPhH
Executing build stage                               build_stage=after_script job=3306279 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do)                      build_stage=after_script job=3306279 project=3560 runner=mB4fmPhH
Executing build stage                               build_stage=upload_artifacts_on_failure job=3306279 project=3560 runner=mB4fmPhH
Skipping stage (nothing to do)                      build_stage=upload_artifacts_on_failure job=3306279 project=3560 runner=mB4fmPhH
Skipping referees execution                         job=3306279 project=3560 runner=mB4fmPhH
WARNING: Job failed: pods "runner-mb4fmphh-project-3560-concurrent-4wdb4w" not found  duration=7m18.311197726s job=3306279 project=3560 runner=mB4fmPhH
Detaching from log... context canceled              job=3306279 project=3560 runner=mB4fmPhH
Appending trace to coordinator... ok                code=202 job=3306279 job-log=0-5115 job-status=running runner=mB4fmPhH sent-log=4981-5114 status=202 Accepted update-interval=30s
Submitting job to coordinator... ok                 code=200 job=3306279 job-status= runner=mB4fmPhH
ERROR: Error cleaning up pod: pods "runner-mb4fmphh-project-3560-concurrent-4wdb4w" not found  job=3306279 project=3560 runner=mB4fmPhH
WARNING: Failed to process runner                   builds=7 error=pods "runner-mb4fmphh-project-3560-concurrent-4wdb4w" not found executor=kubernetes runner=mB4fmPhH

Environment description

We are using our own on premise deployment for both Gitlab and Gitlab Runners. We are using official helm charts provided by Gitlab for Gitlab Runners(v13.3.0)

config.toml contents
listen_address = ":9252"
concurrent = 600
check_interval = 5
log_level = "debug"
sentry_dsn = "https://sentry_url"

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-runner-gitlab-runner-864df88976-fb97n"
  output_limit = 15000
  request_concurrency = 1
  url = "https://gitlab_url/"
  token = "token"
  executor = "kubernetes"
  environment = [some_vars]
  [runners.custom_build_dir]
  [runners.cache]
    Type = "s3"
    Path = "gitlab_runner"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      AccessKey = "key"
      SecretKey = "key"
      BucketName = "bucket_name"
      BucketLocation = "us-west-2"
    [runners.cache.gcs]
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "docker:dind"
    namespace = "gitlab-runner"
    namespace_overwrite_allowed = ""
    privileged = true
    cpu_limit = "400m"
    cpu_limit_overwrite_max_allowed = "6"
    memory_limit = "2Gi"
    memory_limit_overwrite_max_allowed = "16Gi"
    service_cpu_limit = "200m"
    service_memory_limit = "512Mi"
    helper_cpu_limit = "1"
    helper_memory_limit = "3Gi"
    cpu_request = "300m"
    cpu_request_overwrite_max_allowed = "3"
    memory_request = "2Gi"
    memory_request_overwrite_max_allowed = "16Gi"
    service_cpu_request = "100m"
    service_memory_request = "512Mi"
    helper_cpu_request = "500m"
    helper_memory_request = "2Gi"
    pull_policy = "always"
    poll_timeout = 540
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    [runners.kubernetes.node_selector]
      application = "gitlab-runner"
    [runners.kubernetes.node_tolerations]
      "role=spot-instance" = "NoSchedule"
    [runners.kubernetes.pod_annotations]
      "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
    [runners.kubernetes.pod_security_context]
    [runners.kubernetes.volumes]
      [[runners.kubernetes.volumes.host_path]]
        host_path = "/var/run/docker.sock"
        mount_path = "/var/run/docker.sock"
        name = "docker"
        read_only = false

      [[runners.kubernetes.volumes.config_map]]
        mount_path = "/root/.docker"
        name = "docker-config"
        read_only = false
        [runners.kubernetes.volumes.config_map.items]
          "config.json" = "config.json"

      [[runners.kubernetes.volumes.config_map]]
        mount_path = "/root/.m2"
        name = "mvn-config"
        read_only = false
        [runners.kubernetes.volumes.config_map.items]
          "settings.xml" = "settings.xml"

      [[runners.kubernetes.volumes.config_map]]
        mount_path = "/root/etc/"
        name = "npm-config"
        read_only = false
        [runners.kubernetes.volumes.config_map.items]
          npmrc = "npmrc"

      [[runners.kubernetes.volumes.pvc]]
        mount_path = "/root/ssh"
        name = "gitlab-runner"
        read_only = false

Used GitLab Runner version

gitlab-runner --version
Version:      13.3.0
Git revision: 86ad88ea
Git branch:   13-3-stable
GO version:   go1.13.8
Built:        2020-08-20T06:29:34+0000
OS/Arch:      linux/amd64