Categorize failure on pod status as system failure
What does this MR do?
This MR changed the FailureReason for failed status update for the RunWithAttach
Before the change:.
When listening to updates of the status of the pod in the Kubernetes executor:
https://gitlab.com/gitlab-org/gitlab-runner/-/blob/main/executors/kubernetes/kubernetes.go#L666.
podStatusCh := s.watchPodStatus(ctx)
.
If the status of pod returns an error that isn't a pod_not_found
it will:
return &common.BuildError{Inner: err}
(default FailureReason
being ScriptFailure
)
https://gitlab.com/gitlab-org/gitlab-runner/-/blob/main/executors/kubernetes/kubernetes.go#L677-682.
The output of the job in case of cluster error would therefore be:
ERROR: Job failed: prepare environment: pod "runner-eopgp1ua-project-4670-concurrent-46299b" status is "Failed". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Can be caused by failing admission:
This MR categorizes those errors as RunnerSystemFailures instead of ScriptFailures.
After the change
ERROR: Job failed (system failure): prepare environment: pod "runner-bksmbay-project-117-concurrent-1sk9fb" status is "Failed". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Why was this MR needed?
This MR was needed on our side because we are using retry on RunnerSystemFailures
and because the error above (which is related to the infra) is categorized as ScriptFailure, it was making pipelines to hard fail on those few cases when we just wanted a retry.
What are the relevant issue numbers?
Couldn't find a related issue.
This is not just a Merge Request but more of a question because I might not see the full picture.
I'm guessing there is a reason behind that, otherwise the code would just be return err
in all cases.
case err := <-podStatusCh:
if IsKubernetesPodNotFoundError(err) {
return err
}
return &common.BuildError{Inner: err}
Would just be
case err := <-podStatusCh:
return err
Is there a reason that would justify to let them as ScriptFailures
and to only have pod_not_found
as system failure ?