Kubernetes executor: redial backend on internal server errors (!3732) · Merge requests · GitLab.org / gitlab-runner

You need to sign in or sign up before continuing.

Arran Walker requested to merge ajwalker/k8s-backend-dial-error into main Nov 16, 2022

What does this MR do?

When connecting to a Pod, we have retry-on-failure logic when the error returned is an internal server error with the message error dialing backend: EOF.

This MR expands the retry logic to include any errors that begin with error dialing backend that are of the "internal server error" type.

Why was this MR needed?

This changes catches other types of temporary backend failures, including a problematic error dialing backend: remote error: tls: internal error when a Pod has been scheduled to a Node, but the Node's certificate isn't ready to accept connections.

The retry count has been increased from 5 to 30 to accommodate the delays seen by such a failure.

What's the best way to test this MR?

This has been tested by a customer that was experiencing this problem: !3556 (comment 1174648462)

There were existing tests for dialing backend errors, we've just expanded the scope.

What are the relevant issue numbers?

#27901 (closed)

Edited Nov 17, 2022 by Arran Walker

Kubernetes executor: redial backend on internal server errors

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports