Skip to content

Categorize failure on pod status as system failure

What does this MR do?

This MR changed the FailureReason for failed status update for the RunWithAttach

Before the change:.
When listening to updates of the status of the pod in the Kubernetes executor:
https://gitlab.com/gitlab-org/gitlab-runner/-/blob/main/executors/kubernetes/kubernetes.go#L666.
podStatusCh := s.watchPodStatus(ctx).

If the status of pod returns an error that isn't a pod_not_found it will:
return &common.BuildError{Inner: err} (default FailureReason being ScriptFailure) https://gitlab.com/gitlab-org/gitlab-runner/-/blob/main/executors/kubernetes/kubernetes.go#L677-682.

The output of the job in case of cluster error would therefore be:

ERROR: Job failed: prepare environment: pod "runner-eopgp1ua-project-4670-concurrent-46299b" status is "Failed". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Can be caused by failing admission:

This MR categorizes those errors as RunnerSystemFailures instead of ScriptFailures.

After the change

ERROR: Job failed (system failure): prepare environment: pod "runner-bksmbay-project-117-concurrent-1sk9fb" status is "Failed". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Why was this MR needed?

This MR was needed on our side because we are using retry on RunnerSystemFailures and because the error above (which is related to the infra) is categorized as ScriptFailure, it was making pipelines to hard fail on those few cases when we just wanted a retry.

What are the relevant issue numbers?

Couldn't find a related issue.


This is not just a Merge Request but more of a question because I might not see the full picture. I'm guessing there is a reason behind that, otherwise the code would just be return err in all cases.

 case err := <-podStatusCh: 
 	if IsKubernetesPodNotFoundError(err) { 
 		return err 
 	} 
  
 	return &common.BuildError{Inner: err} 

Would just be

 case err := <-podStatusCh: 
 	return err 

Is there a reason that would justify to let them as ScriptFailures and to only have pod_not_found as system failure ? 🤔

Edited by Baptiste Lalanne

Merge request reports