Skip to content

Kubernetes executor: redial backend on internal server errors

Arran Walker requested to merge ajwalker/k8s-backend-dial-error into main

What does this MR do?

When connecting to a Pod, we have retry-on-failure logic when the error returned is an internal server error with the message error dialing backend: EOF.

This MR expands the retry logic to include any errors that begin with error dialing backend that are of the "internal server error" type.

Why was this MR needed?

This changes catches other types of temporary backend failures, including a problematic error dialing backend: remote error: tls: internal error when a Pod has been scheduled to a Node, but the Node's certificate isn't ready to accept connections.

The retry count has been increased from 5 to 30 to accommodate the delays seen by such a failure.

What's the best way to test this MR?

This has been tested by a customer that was experiencing this problem: !3556 (comment 1174648462)

There were existing tests for dialing backend errors, we've just expanded the scope.

What are the relevant issue numbers?

#27901 (closed)

Edited by Arran Walker

Merge request reports