Skip to content

Wait for k8s pod to become attachable as part of poll period

What does this MR do?

Waits for a the k8s pod to become attachable after it has started running and before proceeding with the rest of executor flow.

Why was this MR needed?

Without it newly joined k8s nodes will be marked as ready and have pods scheduled on them before their certificate requests are completed which will result in attaching and streaming logs failing. For k8s clusters using the cluster autoscaler or similar this makes the runner extremely flaky (> 50% of 50 jobs consistently fail when scaling from 0 workers).

Upstream k8s does not consider waiting for the certificate to be a requirement of a node becoming ready. As such the assumption that the Gitlab runner currently makes (that pods will be attachable if running) is not guaranteed by k8s and in the case of autoscaled clusters is not true for 0-60s after a node is marked ready.

Impact

When the node on which a job is scheduled is already in a state usuable to the Gitlab runner, one additional API call is made at the start of a job to validate that.

When a node is not in a usable state an additional API call is made at the already configured polling interval until the node becomes is a usable state.

What's the best way to test this MR?

  • setup k8s cluster with auto scaler
  • create a matrix job with 50 variations
  • schedule the job with a worker pool of zero
  • watch the failures roll in

What are the relevant issue numbers?

Merge request reports

Loading