Improve our diagnosis toolkit for review apps issues
Context
This is an incident we had recently, and I am frustrated by the fact that those problems are hard to diagnose with the current set of processes/tools we have.
Goals
- Anybody reading our RUNBOOKs should feel empowered to diagnose the root cause of a review app incident.
- Anybody reading our RUNBOOKs should be able to diagnose the root cause of most review apps in less than 30min (this number is arbitrary).
Ideas
- Reach out to another team (SREs, Distribution, ...) for guidance on how to effectively diagnose issues in our current GKE setup (ask somebody for a meeting or a past recording)
- Add tooling (or making some "views" in GCP console or another existing tool) to see what GitLab component was not responding (i.e. 400s/500s) or where the pods failed for a given review app...
- Consider a better logging strategy: Maybe we need something better than GKE logs.
Edited by David Dieulivol