Skip to content

Create AI troubleshooting/incident runbook

Extracted from the recent incident discussion: gitlab-com/gl-infra/production#18064 (comment 1932891015)

We want to introduce a runbook for troubleshooting AI-related errors.
They may fall into many categories.
Ideally, we want to cover the most common symptoms and the algorithm for diagnosing the root cause.

Example: For authentication/authorization errors, we want to understand – are they due to changes in our:

  • Extensions?
  • Language server?
  • Monolith?
  • AI gateway?
  • Cloud Connector configuration?

The resulting runbook should link to the Cloud Connector runbook - gitlab-org/cloud-connector-team/team-tasks#177 (closed) (when it's ready).

There is a lot to unfold here, but we can iterate starting with the simple page:

  • links to logs
  • how to understand the blast radius? (who is affect - all/only-SM (which version?)/only SaaS)
  • how to determine when the last changes where made to respective components
Edited by Aleksei Lipniagov