Improve our diagnosis toolkit for review apps issues

Context

This is an incident we had recently, and I am frustrated by the fact that those problems are hard to diagnose with the current set of processes/tools we have.

Goals

Anybody reading our RUNBOOKs should feel empowered to diagnose the root cause of a review app incident.
Anybody reading our RUNBOOKs should be able to diagnose the root cause of most review apps in less than 30min (this number is arbitrary).

Ideas

Reach out to another team (SREs, Distribution, ...) for guidance on how to effectively diagnose issues in our current GKE setup (ask somebody for a meeting or a past recording)
Add tooling (or making some "views" in GCP console or another existing tool) to see what GitLab component was not responding (i.e. 400s/500s) or where the pods failed for a given review app...
Consider a better logging strategy: Maybe we need something better than GKE logs.

Edited May 30, 2022 by David Dieulivol