Implement error rate alert for each Code Review AI features
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Context
When error rate increased for Code Review AI features, we (groupcode review) don't get alerted by it.
In this issue, we should look at how we can do that and implement alerts that can go to #g_create_code-review_alerts channel.
Ideas
It would be helpful if we can alert per completion class since those are tagged by feature category. Like if a Llm::CompletionWorker is processing a specific completion class, and it went above a specific SLI, an alert for the appropriate feature category will be created
The existing alert for Llm::CompletionWorker is tagged under ai_abstraction_layer feature category. It's posted in #feeds_alerts_general. Example:
Click to expand
Alert Firing: The llm_completion SLI of the sidekiq service (main stage) has an error rate violating SLO
These signifies operations that reach out to a language model with a prompt. These interactions with an AI provider are executed within Llm::CompletionWorker-jobs. The worker could execute multiple requests to an AI provider for a single operation.
A success means that we were able to present the user with a response that is delivered to a client that is subscribed to a websocket. An error could be that the AI-provider is not responding, or is erroring.
For the apdex, we consider an operation fast enough if we were able to get a complete response from the AI provider within 20 seconds. This does not include the time it took for the Sidekiq job to get picked up, or the time it took to deliver the response to the client.
The service_class label on the source metrics tells us which AI related features the operation is for.
These operations do not go through the API gateway yet, but will in the future.
Currently the error-rate is 51.63%.
:name_badge: alertname SidekiqServiceLlmCompletionErrorSLOViolation aggregation component alert_class slo_violation :vertical_traffic_light: alert_type symptom component llm_completion :city_sunset: environment gprd :rocket: feature_category ai_abstraction_layer :signal_strength: severity s4 sli_type error :performing_arts: stage main :safety_vest: type sidekiq user_impacting yes window 6h