You need to sign in or sign up before continuing.
Create AI troubleshooting/incident runbook
Extracted from the recent incident discussion: gitlab-com/gl-infra/production#18064 (comment 1932891015)
We want to introduce a runbook for troubleshooting AI-related errors.
They may fall into many categories.
Ideally, we want to cover the most common symptoms and the algorithm for diagnosing the root cause.
Example: For authentication/authorization errors, we want to understand – are they due to changes in our:
- Extensions?
- Language server?
- Monolith?
- AI gateway?
- Cloud Connector configuration?
The resulting runbook should link to the Cloud Connector runbook - gitlab-org/cloud-connector-team/team-tasks#177 (closed) (when it's ready).
There is a lot to unfold here, but we can iterate starting with the simple page:
- links to logs
- how to understand the blast radius? (who is affect - all/only-SM (which version?)/only SaaS)
- how to determine when the last changes where made to respective components
Edited by Aleksei Lipniagov