Implement error rate alert for each Code Review AI features

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

  • Close this issue

Context

When error rate increased for Code Review AI features, we (groupcode review) don't get alerted by it.

In this issue, we should look at how we can do that and implement alerts that can go to #g_create_code-review_alerts channel.

Ideas

It would be helpful if we can alert per completion class since those are tagged by feature category. Like if a Llm::CompletionWorker is processing a specific completion class, and it went above a specific SLI, an alert for the appropriate feature category will be created

The existing alert for Llm::CompletionWorker is tagged under ai_abstraction_layer feature category. It's posted in #feeds_alerts_general. Example:

Click to expand

Alert Firing: The llm_completion SLI of the sidekiq service (main stage) has an error rate violating SLO

These signifies operations that reach out to a language model with a prompt. These interactions with an AI provider are executed within Llm::CompletionWorker-jobs. The worker could execute multiple requests to an AI provider for a single operation.

A success means that we were able to present the user with a response that is delivered to a client that is subscribed to a websocket. An error could be that the AI-provider is not responding, or is erroring.

For the apdex, we consider an operation fast enough if we were able to get a complete response from the AI provider within 20 seconds. This does not include the time it took for the Sidekiq job to get picked up, or the time it took to deliver the response to the client.

The service_class label on the source metrics tells us which AI related features the operation is for.

These operations do not go through the API gateway yet, but will in the future.

Currently the error-rate is 51.63%.

:name_badge: alertname SidekiqServiceLlmCompletionErrorSLOViolation
aggregation component
alert_class slo_violation
:vertical_traffic_light: alert_type symptom
component llm_completion
:city_sunset: environment gprd
:rocket: feature_category ai_abstraction_layer
:signal_strength: severity s4
sli_type error
:performing_arts: stage main
:safety_vest: type sidekiq
user_impacting yes
window 6h
Edited Aug 18, 2025 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading