Gitlab k8s runner status script_failure instead of runner_system_failure on GCP preemptible nodes
Status update (2024-09-06)
-
We ran tests on a gke version 1.29.7-gke.1104000. However the current error message that we see when we remove the node for an active job is
ERROR: Job failed (system failure): -
If you are still experiencing the stated problem of a
script_failureerror generated for this use case, then add a comment below with your configuration details -itLab version, Runner version, GKE version.
Summary
I see this issue only on GCP with preemptible nodes cluster. When GCP preempted node, I see following error on the gitlab runner log output:
ERROR: Job failed: pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending"
And it marks as script failure, instead of runner system failure.
Similar issue was closed #26856 (comment 637972287), but issue still persist on gitlab runner 14.1.
Can be related to #25463.
Steps to reproduce
Run GKE cluster with preemptible nodes. Gitlab runners should run on preemptible node. Simple delete pod doesn't work as failure will be different. Wait until GCP preempted node and see failure reason.
.gitlab-ci.yml
stages:
- test
Test:
stage: test
image: any_image
retry:
max: 2
when: runner_system_failure
script:
- sleep 7200
tags:
- kubernetes # running the job using kubernetes runner
Actual behavior
When gitlab job failed, it marks as script failure
Expected behavior
The job should marked as runner system failure
Relevant logs and/or screenshots
job log
py36-commit-11a421a3: Pulling from mytest/mytest/test-int-airflow-full
Uploading artifacts for failed job
00:00
Cleaning up file based variables
00:00
ERROR: Job failed: pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending"
runner manager lob
k logs -f main-gitlab-runner-6d96f6f668-sgcp7 | grep runner-vzn22vuk-project-7965327-concurrent-24r5lkr
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying... job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying... job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying... job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: connect: connection refused. Retrying... job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error streaming logs gitlab/runner-vzn22vuk-project-7965327-concurrent-24r5lkr/helper:/logs-7965327-1458990743/output.log: error dialing backend: dial tcp 10.54.32.93:10250: i/o timeout. Retrying... job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Error while executing file based variables removal script error=pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" (on namespace "gitlab") is not running and cannot execute commands; current phase is "Pending" job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Job failed: pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending" duration_s=282.836680634 job=1458990743 project=7965327 runner=Vzn22VuK
WARNING: Failed to process runner builds=58 error=pod "runner-vzn22vuk-project-7965327-concurrent-24r5lkr" status is "Pending" executor=kubernetes runner=Vzn22VuK
GKE logs
{
insertId: "p967pre44yh6"
logName: "projects/mytest/logs/cloudaudit.googleapis.com%2Fsystem_event"
operation: {
first: true
id: "systemevent-1627478098635-5c82ec5f2eb96-a78355f7-75001ba6"
last: true
producer: "compute.instances.preempted"
}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {
principalEmail: "system@google.com"
}
methodName: "compute.instances.preempted"
request: {
@type: "type.googleapis.com/compute.instances.preempted"
}
resourceName: "projects/mytest/zones/us-east1-c/instances/gke-dbnd-dev-gitlab-main-0ad83d82-snhk"
serviceName: "compute.googleapis.com"
status: {
message: "Instance was preempted."
}
}
receiveTimestamp: "2021-07-28T13:15:02.634243265Z"
resource: {
labels: {
instance_id: "7775617557627464150"
project_id: "mytest"
zone: "us-east1-c"
}
type: "gce_instance"
}
severity: "INFO"
timestamp: "2021-07-28T13:15:01.963041Z"
}
Environment description
We use SaaS gitlab.com server and gitlab runner installed on our GKE cluster(1.19.10) with helm chart v0.31. We are using k8s executor.
values.yaml helm
## The GitLab Server URL (with protocol) that want to register the runner against
## ref: https://docs.gitlab.com/runner/commands/README.html#gitlab-runner-register
##
gitlabUrl: https://gitlab.com/
## The registration token for adding new Runners to the GitLab server. This must
## be retrieved from your GitLab instance.
## ref: https://docs.gitlab.com/ee/ci/runners/
##
runnerRegistrationToken: "mytoken"
concurrent: 150
nodeSelector:
node_pool: system
rbac:
serviceAccountName: gitlab-runner-admin
replicas: 2
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 1Gi
cpu: 256m
runners:
locked: false
requestConcurrency: 10
config: |
[[runners]]
output_limit = 16384
environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
pre_clone_script = "echo '172.65.251.78 gitlab.com' >> /etc/hosts"
[runners.kubernetes]
image = "ubuntu:18.04"
privileged = true
service_account = "gitlab-runner-admin"
poll_timeout = 1200
cpu_limit = "2"
cpu_limit_overwrite_max_allowed = "4"
memory_limit = "2Gi"
memory_limit_overwrite_max_allowed = "8Gi"
cpu_request = "700m"
cpu_request_overwrite_max_allowed = "3"
memory_request = "1Gi"
memory_request_overwrite_max_allowed = "5Gi"
service_cpu_limit = "3"
service_cpu_limit_overwrite_max_allowed = "4"
service_memory_limit = "3Gi"
service_memory_limit_overwrite_max_allowed = "8Gi"
service_cpu_request = "700m"
service_cpu_request_overwrite_max_allowed = "3"
service_memory_request = "2Gi"
service_memory_request_overwrite_max_allowed = "4Gi"
helper_cpu_limit = "2"
helper_cpu_limit_overwrite_max_allowed = "3"
helper_memory_limit = "256Mi"
helper_memory_limit_overwrite_max_allowed = "2Gi"
helper_cpu_request = "100m"
helper_cpu_request_overwrite_max_allowed = "1"
helper_memory_request = "128Mi"
helper_memory_request_overwrite_max_allowed = "1Gi"
[runners.kubernetes.node_selector]
node_pool = "gitlab-main"
[runners.kubernetes.pod_annotations]
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
[[runners.kubernetes.dns_config.options]]
name = "single-request-reopen"
[[runners.kubernetes.dns_config.options]]
name = "ndots"
value = "2"
metrics:
enabled: true
config.toml contents
listen_address = ":9252"
concurrent = 150
check_interval = 30
log_level = "info"
[session_server]
session_timeout = 1800
[[runners]]
name = "main-gitlab-runner-6d96f6f668-sgcp7"
output_limit = 16384
request_concurrency = 10
url = "https://gitlab.com/"
token = "mytoken"
executor = "kubernetes"
environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
pre_clone_script = "echo '172.65.251.78 gitlab.com' >> /etc/hosts"
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = "ubuntu:18.04"
namespace = "gitlab"
namespace_overwrite_allowed = ""
privileged = true
cpu_limit = "2"
cpu_limit_overwrite_max_allowed = "4"
cpu_request = "700m"
cpu_request_overwrite_max_allowed = "3"
memory_limit = "2Gi"
memory_limit_overwrite_max_allowed = "8Gi"
memory_request = "1Gi"
memory_request_overwrite_max_allowed = "5Gi"
service_cpu_limit = "3"
service_cpu_limit_overwrite_max_allowed = "4"
service_cpu_request = "700m"
service_cpu_request_overwrite_max_allowed = "3"
service_memory_limit = "3Gi"
service_memory_limit_overwrite_max_allowed = "8Gi"
service_memory_request = "2Gi"
service_memory_request_overwrite_max_allowed = "4Gi"
helper_cpu_limit = "2"
helper_cpu_limit_overwrite_max_allowed = "3"
helper_cpu_request = "100m"
helper_cpu_request_overwrite_max_allowed = "1"
helper_memory_limit = "256Mi"
helper_memory_limit_overwrite_max_allowed = "2Gi"
helper_memory_request = "128Mi"
helper_memory_request_overwrite_max_allowed = "1Gi"
poll_timeout = 1200
service_account = "gitlab-runner-admin"
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
[runners.kubernetes.node_selector]
node_pool = "gitlab-main"
[runners.kubernetes.affinity]
[runners.kubernetes.pod_annotations]
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
[runners.kubernetes.pod_security_context]
[runners.kubernetes.volumes]
[runners.kubernetes.dns_config]
[[runners.kubernetes.dns_config.options]]
name = "single-request-reopen"
[[runners.kubernetes.dns_config.options]]
name = "ndots"
value = "2"
Used GitLab Runner version
gitlab-runner --version
Version: 14.1.0
Git revision: 8925d9a0
Git branch: 14-1-stable
GO version: go1.13.8
Built: 2021-07-20T11:43:26+0000
OS/Arch: linux/amd64
Possible fixes
Don't know.
