Wait for k8s pod to become attachable as part of poll period
What does this MR do?
Waits for a the k8s pod to become attachable after it has started running and before proceeding with the rest of executor flow.
Why was this MR needed?
Without it newly joined k8s nodes will be marked as ready and have pods scheduled on them before their certificate requests are completed which will result in attaching and streaming logs failing. For k8s clusters using the cluster autoscaler or similar this makes the runner extremely flaky (> 50% of 50 jobs consistently fail when scaling from 0 workers).
Upstream k8s does not consider waiting for the certificate to be a requirement of a node becoming ready. As such the assumption that the Gitlab runner currently makes (that pods will be attachable if running) is not guaranteed by k8s and in the case of autoscaled clusters is not true for 0-60s after a node is marked ready.
Impact
When the node on which a job is scheduled is already in a state usuable to the Gitlab runner, one additional API call is made at the start of a job to validate that.
When a node is not in a usable state an additional API call is made at the already configured polling interval until the node becomes is a usable state.
What's the best way to test this MR?
- setup k8s cluster with auto scaler
- create a matrix job with 50 variations
- schedule the job with a worker pool of zero
- watch the failures roll in
What are the relevant issue numbers?
Merge request reports
Activity
Hey @jimmy-outschool!
Thank you for your contribution to GitLab. Please refer to the contribution flow documentation for a quick overview of the process, and the merge request (MR) guidelines for the detailed process.
When you're ready for a first review, post
@gitlab-bot ready
. If you know a relevant reviewer(s) (for example, someone that was involved in a related issue), you can also assign them directly with@gitlab-bot ready @user1 @user2
.At any time, if you need help moving the MR forward, feel free to post
@gitlab-bot help
. Read more on how to get help.Some contributions require several iterations of review and we try to mentor contributors during this process. However, we understand that some reviews can be very time consuming. If you would prefer for us to continue the work you've submitted now or at any point in the future please let us know.
If you're okay with being part of our review process (and we hope you are!), there are several initial checks we ask you to make:
- The merge request description clearly explains:
- The problem being solved.
- The best way a reviewer can test your changes (is it possible to provide an example?).
- If the pipeline failed, do you need help identifying what failed?
- Check that Go code follows our Go guidelines.
- Read our contributing to GitLab Runner document.
This message was generated automatically. You're welcome to improve it.
- The merge request description clearly explains:
added Community contribution workflowin dev labels
assigned to @jimmy-outschool
mentioned in issue #27901 (closed)
@gitlab-bot ready
Hi @jimmy-outschool, thanks for picking this issue up and opening an MR, unfortunately i'm not able to help with a review here, but i've assigned someone who might be able to help.
In the meantime, a quick glance at the pipelines looks like there might be some failing tests.
added workflowready for review label and removed workflowin dev label
@ekigbo
, this Community contribution is ready for review.- Do you have capacity and domain expertise to review this? We are mindful of your time, so if you are not able to take this on, please re-assign to one or more other reviewers.
- Add the workflowin dev label if the merge request needs action from the author. This message was generated automatically. You're welcome to improve it.
Hi @akohlbecker are you able to help review this MR?
requested review from @ekigbo
Customer record: https://gitlab.my.salesforce.com/0014M00001sGS1F
added customer label
267 errc := make(chan error) 268 go func() { 269 defer close(errc) 270 errc <- getPodLog(c, pod) 271 }() 272 return errc 273 } 274 275 func waitForPodAttach( 276 ctx context.Context, 277 c *kubernetes.Clientset, 278 pod *api.Pod, 279 config *common.KubernetesConfig, 280 ) error { 281 pollInterval := config.GetPollInterval() 282 pollAttempts := config.GetPollAttempts() added 1st contribution label
requested review from @akohlbecker
removed review request for @akohlbecker
requested review from @akohlbecker
removed review request for @ekigbo
removed needs investigation label
- Link to request: Internal ZD ticket
- Priority: customer priority10
- Why interested: Customer is trying to use GitLab Runner (Kubernetes executor) with the cluster autoscaler for AWS
- Problem they are trying to solve: Job is failing with
ERROR: Job failed (system failure): prepare environment: error dialing backend: remote error: tls: internal error.
. CSR is pending and Node is marked as ready which cause this error message. - Current solution for this problem: No acceptable solution or workaround (using 'retry') has been found.
- Impact to the customer of not having this: Currently the customer cannot use autoscalling.
- Questions: As part of the epic Autoscaling Provider for GitLab Runner, could you please help to find a reviewer for this MR ?
- PM to mention: @DarrenEastman
Edited by Segolene Boulyrequested review from @ajwalker and removed review request for @akohlbecker