Skip to content
Snippets Groups Projects

Wait for k8s pod to become attachable as part of poll period

What does this MR do?

Waits for a the k8s pod to become attachable after it has started running and before proceeding with the rest of executor flow.

Why was this MR needed?

Without it newly joined k8s nodes will be marked as ready and have pods scheduled on them before their certificate requests are completed which will result in attaching and streaming logs failing. For k8s clusters using the cluster autoscaler or similar this makes the runner extremely flaky (> 50% of 50 jobs consistently fail when scaling from 0 workers).

Upstream k8s does not consider waiting for the certificate to be a requirement of a node becoming ready. As such the assumption that the Gitlab runner currently makes (that pods will be attachable if running) is not guaranteed by k8s and in the case of autoscaled clusters is not true for 0-60s after a node is marked ready.

Impact

When the node on which a job is scheduled is already in a state usuable to the Gitlab runner, one additional API call is made at the start of a job to validate that.

When a node is not in a usable state an additional API call is made at the already configured polling interval until the node becomes is a usable state.

What's the best way to test this MR?

  • setup k8s cluster with auto scaler
  • create a matrix job with 50 variations
  • schedule the job with a worker pool of zero
  • watch the failures roll in

What are the relevant issue numbers?

Edited by Darren Eastman

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
267 errc := make(chan error)
268 go func() {
269 defer close(errc)
270 errc <- getPodLog(c, pod)
271 }()
272 return errc
273 }
274
275 func waitForPodAttach(
276 ctx context.Context,
277 c *kubernetes.Clientset,
278 pod *api.Pod,
279 config *common.KubernetesConfig,
280 ) error {
281 pollInterval := config.GetPollInterval()
282 pollAttempts := config.GetPollAttempts()
  • This isn't perfect since it will stack on the time allowed for waitForPodRunning(). I mostly wanted a mostly clean fix and figured the maintainers might have thoughts on how to set this up.

  • Please register or sign in to reply
  • Ezekiel Kigbo requested review from @akohlbecker

    requested review from @akohlbecker

  • Ezekiel Kigbo removed review request for @akohlbecker

    removed review request for @akohlbecker

  • Ezekiel Kigbo requested review from @akohlbecker

    requested review from @akohlbecker

  • Ezekiel Kigbo removed review request for @ekigbo

    removed review request for @ekigbo

    • Link to request: Internal ZD ticket
    • Priority: customer priority10
    • Why interested: Customer is trying to use GitLab Runner (Kubernetes executor) with the cluster autoscaler for AWS
    • Problem they are trying to solve: Job is failing with ERROR: Job failed (system failure): prepare environment: error dialing backend: remote error: tls: internal error.. CSR is pending and Node is marked as ready which cause this error message.
    • Current solution for this problem: No acceptable solution or workaround (using 'retry') has been found.
    • Impact to the customer of not having this: Currently the customer cannot use autoscalling.
    • Questions: As part of the epic Autoscaling Provider for GitLab Runner, could you please help to find a reviewer for this MR ?
    • PM to mention: @DarrenEastman
    Edited by Segolene Bouly
  • Arran Walker requested review from @ajwalker and removed review request for @akohlbecker

    requested review from @ajwalker and removed review request for @akohlbecker

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading