AutoDeploy is unstable in Cells (#20623) · Issues · GitLab.com / GitLab Infrastructure Team / delivery · GitLab

AutoDeploy is unstable in Cells

### Problem Description Auto-Deploys in our Cells infrastructure are commonly failing. The reason appears to be tied to a time limit of the scripting that performs the rollout. Utilize this issue to investigate why we are failing. Spin up an issue to determine how we can expose these failures as currently they are slipping under the radar at the moment. ### Example Using https://ops.gitlab.net/gitlab-com/gl-infra/cells/tissue/-/jobs/15777514 as an example, we can see we were waiting for Pods to rotate within a 15 minute window. When we fail various between differing jobs and Cells. I don't think the script is the problem as occasionally we'll see a successful deploy, or a retry will work just fine. We may need to interrogate the Pods as they are cycling through to determine what is happening and why we are hitting the time limit of the script as defined. Note this is not the Job timeout, but the waiter of the `kubectl` call that is failing. ### Investigation Details * Implementation Instructions: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20623#note_2230644710 * Instructions for acquiring a flamegraph: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20623#note_2230604998 ### Conclusive Results We've effectively narrowed this down to disk IO issues: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20623#note_2242173616 We've since upgraded our Reference Archs to leverage SSD's, and later updated our Cells. ### Exit Criterion * [x] Investigate * [x] Fix * [x] Validate * [x] Issues to address observability are raised - https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20623#note_2216258902 * [x] Issues to address gaps in troubleshooting are raised -https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20623#note_2216258902 * [ ] Documentation in `runbooks` updated or additions created to account for the progress and troubleshooting

issue