Skip to content

Gitlab k8s runner status script_failure instead of runner_system_failure on GCP preemptible nodes

Status update (2024-09-06)

  • We ran tests on a gke version 1.29.7-gke.1104000. However the current error message that we see when we remove the node for an active job is ERROR: Job failed (system failure):

  • If you are still experiencing the stated problem of a script_failure error generated for this use case, then add a comment below with your configuration details -itLab version, Runner version, GKE version.

Summary

I see this issue only on GCP with preemptible nodes cluster. When GCP preempted node, I see following error on the gitlab runner log output:

ERROR: Job failed: pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending"

And it marks as script failure, instead of runner system failure. Similar issue was closed #26856 (comment 637972287), but issue still persist on gitlab runner 14.1. Can be related to #25463.

Steps to reproduce

Run GKE cluster with preemptible nodes. Gitlab runners should run on preemptible node. Simple delete pod doesn't work as failure will be different. Wait until GCP preempted node and see failure reason.

.gitlab-ci.yml
stages:
  - test

Test:
  stage: test
  image: any_image
  retry:
    max: 2
    when: runner_system_failure
  script:
    - sleep 7200
  tags:
    - kubernetes # running the job using kubernetes runner

Actual behavior

When gitlab job failed, it marks as script failure

Expected behavior

The job should marked as runner system failure

Relevant logs and/or screenshots

job log
py36-commit-11a421a3: Pulling from mytest/mytest/test-int-airflow-full
Uploading artifacts for failed job
00:00
Cleaning up file based variables
00:00
ERROR: Job failed: pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending"
runner manager lob
k logs -f main-gitlab-runner-6d96f6f668-sgcp7 | grep runner-vzn22vuk-project-7965327-concurrent-24r5lkr
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying...  job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying...  job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying...  job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying...  job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: i/o timeout. Retrying...  job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error while executing file based variables removal script  error=pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" (on namespace "gitlab") is not running and cannot execute commands; current phase is "Pending" job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Job failed: pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending"  duration_s=282.836680634 job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Failed to process runner                   builds=58 error=pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending" executor=kubernetes runner=Vzn22VuK
GKE logs
{
insertId: "p967pre44yh6"
logName: "projects/mytest/logs/cloudaudit.googleapis.com%2Fsystem_event"
operation: {
first: true
id: "systemevent-1627478098635-5c82ec5f2eb96-a78355f7-75001ba6"
last: true
producer: "compute.instances.preempted"
}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {
principalEmail: "system@google.com"
}
methodName: "compute.instances.preempted"
request: {
@type: "type.googleapis.com/compute.instances.preempted"
}
resourceName: "projects/mytest/zones/us-east1-c/instances/gke-dbnd-dev-gitlab-main-0ad83d82-snhk"
serviceName: "compute.googleapis.com"
status: {
message: "Instance was preempted."
}
}
receiveTimestamp: "2021-07-28T13:15:02.634243265Z"
resource: {
labels: {
instance_id: "7775617557627464150"
project_id: "mytest"
zone: "us-east1-c"
}
type: "gce_instance"
}
severity: "INFO"
timestamp: "2021-07-28T13:15:01.963041Z"
}

Screenshot_2021-07-28_at_19.15.11

Environment description

We use SaaS gitlab.com server and gitlab runner installed on our GKE cluster(1.19.10) with helm chart v0.31. We are using k8s executor.

values.yaml helm
## The GitLab Server URL (with protocol) that want to register the runner against
## ref: https://docs.gitlab.com/runner/commands/README.html#gitlab-runner-register
##
gitlabUrl: https://gitlab.com/

## The registration token for adding new Runners to the GitLab server. This must
## be retrieved from your GitLab instance.
## ref: https://docs.gitlab.com/ee/ci/runners/
##
runnerRegistrationToken: "mytoken"

concurrent: 150

nodeSelector:
  node_pool: system

rbac:
  serviceAccountName: gitlab-runner-admin

replicas: 2

resources: 
  limits:
    memory: 1Gi
    cpu: 1
  requests:
    memory: 1Gi
    cpu: 256m

runners:
  locked: false
  requestConcurrency: 10
  config: |
    [[runners]]
      output_limit = 16384
      environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
      pre_clone_script = "echo '172.65.251.78 gitlab.com' >> /etc/hosts"
      [runners.kubernetes]
        image = "ubuntu:18.04"
        privileged = true
        service_account = "gitlab-runner-admin"
        poll_timeout = 1200
        cpu_limit = "2"
        cpu_limit_overwrite_max_allowed = "4"
        memory_limit = "2Gi"
        memory_limit_overwrite_max_allowed = "8Gi"
        cpu_request = "700m"
        cpu_request_overwrite_max_allowed = "3"
        memory_request = "1Gi"
        memory_request_overwrite_max_allowed = "5Gi"
        service_cpu_limit = "3"
        service_cpu_limit_overwrite_max_allowed = "4"
        service_memory_limit = "3Gi"
        service_memory_limit_overwrite_max_allowed = "8Gi"
        service_cpu_request = "700m"
        service_cpu_request_overwrite_max_allowed = "3"
        service_memory_request = "2Gi"
        service_memory_request_overwrite_max_allowed = "4Gi"
        helper_cpu_limit = "2"
        helper_cpu_limit_overwrite_max_allowed = "3"
        helper_memory_limit = "256Mi"
        helper_memory_limit_overwrite_max_allowed = "2Gi"
        helper_cpu_request = "100m"
        helper_cpu_request_overwrite_max_allowed = "1"
        helper_memory_request = "128Mi"
        helper_memory_request_overwrite_max_allowed = "1Gi"
        [runners.kubernetes.node_selector]
          node_pool = "gitlab-main"
        [runners.kubernetes.pod_annotations]
          "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
      [[runners.kubernetes.dns_config.options]]
        name = "single-request-reopen"
      [[runners.kubernetes.dns_config.options]]
        name = "ndots"
        value = "2"

metrics:
  enabled: true
config.toml contents
listen_address = ":9252"
concurrent = 150
check_interval = 30
log_level = "info"

[session_server]
  session_timeout = 1800

[[runners]]
  name = "main-gitlab-runner-6d96f6f668-sgcp7"
  output_limit = 16384
  request_concurrency = 10
  url = "https://gitlab.com/"
  token = "mytoken"
  executor = "kubernetes"
  environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
  pre_clone_script = "echo '172.65.251.78 gitlab.com' >> /etc/hosts"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "ubuntu:18.04"
    namespace = "gitlab"
    namespace_overwrite_allowed = ""
    privileged = true
    cpu_limit = "2"
    cpu_limit_overwrite_max_allowed = "4"
    cpu_request = "700m"
    cpu_request_overwrite_max_allowed = "3"
    memory_limit = "2Gi"
    memory_limit_overwrite_max_allowed = "8Gi"
    memory_request = "1Gi"
    memory_request_overwrite_max_allowed = "5Gi"
    service_cpu_limit = "3"
    service_cpu_limit_overwrite_max_allowed = "4"
    service_cpu_request = "700m"
    service_cpu_request_overwrite_max_allowed = "3"
    service_memory_limit = "3Gi"
    service_memory_limit_overwrite_max_allowed = "8Gi"
    service_memory_request = "2Gi"
    service_memory_request_overwrite_max_allowed = "4Gi"
    helper_cpu_limit = "2"
    helper_cpu_limit_overwrite_max_allowed = "3"
    helper_cpu_request = "100m"
    helper_cpu_request_overwrite_max_allowed = "1"
    helper_memory_limit = "256Mi"
    helper_memory_limit_overwrite_max_allowed = "2Gi"
    helper_memory_request = "128Mi"
    helper_memory_request_overwrite_max_allowed = "1Gi"
    poll_timeout = 1200
    service_account = "gitlab-runner-admin"
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    [runners.kubernetes.node_selector]
      node_pool = "gitlab-main"
    [runners.kubernetes.affinity]
    [runners.kubernetes.pod_annotations]
      "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
    [runners.kubernetes.pod_security_context]
    [runners.kubernetes.volumes]
    [runners.kubernetes.dns_config]

      [[runners.kubernetes.dns_config.options]]
        name = "single-request-reopen"

      [[runners.kubernetes.dns_config.options]]
        name = "ndots"
        value = "2"

Used GitLab Runner version

gitlab-runner --version
Version:      14.1.0
Git revision: 8925d9a0
Git branch:   14-1-stable
GO version:   go1.13.8
Built:        2021-07-20T11:43:26+0000
OS/Arch:      linux/amd64

Possible fixes

Don't know.

Edited by Darren Eastman