Duo Chat runbook
Problem statement
When there is a problem with Duo Chat on production, how do we debug the issue? Depending on what time the problem comes in, any engineer on the Duo Chat team should be able to effectively identify a root cause and either fix the issue or re-route the incident to the appropriate team.
The root cause of a Duo Chat problem could be many things:
- Prompt changes causing poor results
- User does not have access to Duo Chat due to project settings or licensing
- Bug in Cloud Connector license check
- AI Gateway outage
By having a shared runbook for how to identify the root cause, we also provide a place for sharing any lessons for an incident. For example, if a specific kibana query was critical in identifying the root cause but was not in the runbook, it can be added to the runbook as an action item after the incident.
We have some debugging tips in the development docs but we should build something that is more incident/production-specific.
Related issue from Cloud Connector: gitlab-org/cloud-connector-team/team-tasks#177 (closed)
Proposal
We should have a structured document with troubleshooting guidance and links (such as logs, maybe even filtered), dashboards, etc., which would allow us to participate efficiently in incidents (potentially) related to Duo Chat.
This runbook should be actively maintained by the team and serve as a place for sharing lessons on how to debug production incidents.