Skip to content

Gitlab k8s runner status script_failure instead of runner_system_failure on AWS Spot instances

Summary

It seems like there has been a regression of !2444 (merged) (the issue for this MR is #26856 (closed)). A gitlab.com customer that runs their kubernetes pods on AWS Spot instances reported that when a job pod is evicted while the job is running, it is marked as a script failure instead of a runner system failure.

Sometimes this shows up in the job log as an issue connecting to the Docker daemon:

ERROR: Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?

Other times, it's more obvious:

ERROR: Error cleaning up pod: pods "runner-qn7qyr8ex-project-40783171-concurrent-7-3ud21hdh" not found

The jobs' final error message is always:

ERROR: Job failed: command terminated with exit code 1

#28100 (closed) seems to be essentially the same issue, just on GCP preemptible nodes instead of AWS Spot instances.

Zendesk ticket (internal)

Steps to reproduce

Running a Kubernetes cluster on AWS Spot instances, and with retry:when set for runner_system_failure in the CI config, the job will not be retried if the job pod has been terminated unexpectedly.

.gitlab-ci.yml
stages:
  - test

Test:
  stage: test
  image: any_image
  retry:
    max: 2
    when: runner_system_failure
  script:
    - sleep 7200
  tags:
    - kubernetes # running the job using kubernetes runner

Actual behavior

When a pod eviction causes a job to fail, the failure reason is marked as script_failure.

Expected behavior

When a pod eviction causes a job to fail, the failure reason is marked as runner_system_failure.

Relevant logs and/or screenshots

job log
Add the job log

Environment description

config.toml contents
[[runners]]
  output_limit = 16834
  [runners.kubernetes]
    namespace = "<namespace>"
    image = "ubuntu:22.04"
    privileged = true
    poll_timeout = 3600
    cpu_request = "2"
    cpu_limit = "8"
    memory_request = "16Gi"
    memory_limit = "16Gi"
    helper_cpu_request = "0.5"
    service_cpu_request = "1"
    cleanup_grace_period_seconds = 60
    pod_termination_grace_period_seconds = 120
    [runners.kubernetes.container_lifecycle.post_start.exec]
      command = ["mkdir", "-p", "/builds/tmp"]
    [runners.kubernetes.pod_annotations]
      "karpenter.sh/do-not-evict" = "true"
    [runners.kubernetes.node_selector]
      "kubernetes.io/arch" = "amd64"
      "kubernetes.io/os" = "linux"
      "karpenter.sh/nodepool" = "<nodepool-name>"
    [[runners.kubernetes.pod_spec]]
      name = "<pod-name>"
      patch_type = "strategic"
      patch = '''
        containers:
          - name: build
        volumeMounts:
          - name: repo
        mountPath: /builds
          - name: helper
        volumeMounts:
          - name: repo
        mountPath: /builds
        volumes:
          - name: repo
        emptyDir: null
        ephemeral:
        volumeClaimTemplate:
        spec:
        storageClassName: gp3
        accessModes: [ ReadWriteOnce ]
        csi:
        fsType: xfs
        resources:
        requests:
        storage: 384Gi
      '''
  [runners.cache]
    Type = "s3"
    Path = "<path>"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      BucketName = "<bucket-name>"
      BucketLocation = "<location>"
  [runners.feature_flags]
    FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true
    FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true
    FF_PRINT_POD_EVENTS = true
    FF_USE_FASTZIP = true

Used GitLab Runner version

17.0.0 (44feccdf), kubernetes executor

Possible fixes