You need to sign in or sign up before continuing.

Create AI troubleshooting/incident runbook

Extracted from the recent incident discussion: gitlab-com/gl-infra/production#18064 (comment 1932891015)

We want to introduce a runbook for troubleshooting AI-related errors.
They may fall into many categories.
Ideally, we want to cover the most common symptoms and the algorithm for diagnosing the root cause.

Example: For authentication/authorization errors, we want to understand – are they due to changes in our:

Extensions?
Language server?
Monolith?
AI gateway?
Cloud Connector configuration?

The resulting runbook should link to the Cloud Connector runbook - gitlab-org/cloud-connector-team/team-tasks#177 (closed) (when it's ready).

There is a lot to unfold here, but we can iterate starting with the simple page:

links to logs
how to understand the blast radius? (who is affect - all/only-SM (which version?)/only SaaS)
how to determine when the last changes where made to respective components

Edited Jun 12, 2024 by Aleksei Lipniagov