Skip to content

Make kubernetes API retries configurable

What does this MR do?

It allows to specify limit for Kubernetes API calls instead of hardcoded limit in const

Why was this MR needed?

We run our jobs in EKS Kuberentes and use Karpenter for scaling nodes.

When using default limit, our jobs often fail with prepare environment: setting up trapping scripts on emptyDir: error dialing backend: remote error: tls: internal error

When we bumped defaultTries to 35, issue disappeared. Therefore we made this limit configurable defaulting to old limit, not to enforce higher limit for everyone.

We're running this code on on production since 6.12.2023 and successfully processed more than 6000 jobs since then.

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports