Monitoring and Alerting For VertexAPI request failures

Note This work has been moved to groupai framework and is being done within these two issues:

Support for Monitoring and Alerting for Explain... (gitlab-com/gl-infra/scalability#2470 - closed)
AI clients (Vertex AI, Anthropic, OpenAI etc) s... (#421546 - closed)

In the past the Explain This Vulnerability feature was taken offline due to the Vertex API having an outage. We were unaware of this outage, and so the feature was broken for possibly over a day before we knew as a result of a developer actively attempting to use it in production.

We want to configure some form of automated alerting such that the moment the feature ceases to successfully contact the API we're able to act and remediate immediately, rather than have an unknown duration of degraded usefulness for our users.

The initially proposed strategy was to implement a Kibana Watcher that would monitor the logs and send a slack to the #ai_vulnerability_explanation slack channel, however @andrewn expressed that this is not a desirable pattern as this only facilitates alerting for Gitlab.com, but not self-managed instances, recommending an alternative approach using sisense instrumentation.

Instrumentation was recently introduced to track the success rate of VertexAI interations: !127753 (merged)

Andrew has recommended that we wrap instrumentation around the exponential backoff mechanism additionally to get a more accurate tracking off successful interaction from a user perspective: !127753 (comment 1507416556)

Once that instrumentation is in place, if we can configure a Slack Alert, either generally for all Vertex failures to the ai enablement team, or specifically for explain this vulnerability to the threat insights team, that would effectively resolve this issue.

Support Needed

Configuration of Sisense alerts is not a standard task for non-infrastructure/platform teams, as such we aren't really certain how to proceed from there.

Links

https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/llm/concerns/exponential_backoff.rb

Sentry errors: https://sentry.gitlab.net/gitlab/gitlabcom/?query=ExponentialBackoff

Implementation plan

~~Configure an alert with ~~Kibana~~ to send Slack notifications to #ai_vulnerability_explanation when a ~~RateLimitError~~ is raised in the context of explain_this_vulnerability requests.~~

Configure Sisense based alerting which monitors the success of the AiAction requests, and send an alert the #ai_framework_team slack channel accordingly.

Edited Aug 18, 2023 by Neil McCorrison