Monitoring and Alerting For VertexAPI request failures
Note This work has been moved to groupai framework and is being done within these two issues:
- Support for Monitoring and Alerting for Explain... (gitlab-com/gl-infra/scalability#2470 - closed)
- AI clients (Vertex AI, Anthropic, OpenAI etc) s... (#421546 - closed)
In the past the Explain This Vulnerability feature was taken offline due to the Vertex API having an outage. We were unaware of this outage, and so the feature was broken for possibly over a day before we knew as a result of a developer actively attempting to use it in production.
We want to configure some form of automated alerting such that the moment the feature ceases to successfully contact the API we're able to act and remediate immediately, rather than have an unknown duration of degraded usefulness for our users.
The initially proposed strategy was to implement a Kibana Watcher that would monitor the logs and send a slack to the #ai_vulnerability_explanation slack channel, however @andrewn expressed that this is not a desirable pattern as this only facilitates alerting for Gitlab.com, but not self-managed instances, recommending an alternative approach using sisense instrumentation.
Instrumentation was recently introduced to track the success rate of VertexAI interations: !127753 (merged)
Andrew has recommended that we wrap instrumentation around the exponential backoff mechanism additionally to get a more accurate tracking off successful interaction from a user perspective: !127753 (comment 1507416556)
Once that instrumentation is in place, if we can configure a Slack Alert, either generally for all Vertex failures to the ai enablement team, or specifically for explain this vulnerability to the threat insights team, that would effectively resolve this issue.
Support Needed
Configuration of Sisense alerts is not a standard task for non-infrastructure/platform teams, as such we aren't really certain how to proceed from there.
Links
https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/llm/concerns/exponential_backoff.rb
Sentry errors: https://sentry.gitlab.net/gitlab/gitlabcom/?query=ExponentialBackoff
Implementation plan
Configure an alert with to send Slack notifications to Kibana#ai_vulnerability_explanation
when a is raised in the context of explain_this_vulnerability requests.RateLimitError
Configure Sisense based alerting which monitors the success of the AiAction requests, and send an alert the #ai_framework_team slack channel accordingly.