Currently, the score of groupduo chatError Budget dashboard is less than 60.0, which means our feature is not performing well. We should identify the root cause and the fix it.
According to the Rails Requests Apdex, requests to the POST /api/:version/chat/completions is slower than the default threshold (< 1 sec).
Proposal
This endpoint is used for evaluations by groupai model validation group, hence this is not user-facing latency. We should set it to urgency :low (< 5 sec) or ignoring it completely.
Over the weekend there were 30 calls. Though we have set the urgency to low, the Apdex score was still affected (down to 78.9%) because all /api/:version/chat/completions calls took between 5 ~ 10 seconds.
Currently, this endpoint uses LLM in a synchronous manner, meaning there is not always possible to reduce the duration below our current goal of 5 seconds. For exmple, any questions which requires 2 LLM steps will go over that limit.
The short term alternatives are:
Ask the model validation team to use the GraphQL endpoint instead
Set its feature_category as "not_owned" (this is a hack)
The long term solution could be
Make this endpoint work asynchronously
Another cause of low error budget score is because we having too few measurements.
Currently we have around 30 measurements daily, which is far less than the Duo Chat servings. This means it is easy to screw the score by having 1 internal request hitting /api/:version/chat/completions.
Most of the real user requests are hitting Llm::CompletionWorker, which is classified as ai_abstraction_layer because it serves all AI requests. We probably missed most of the good measurements here.
Instead, there is a separate SLI called llm_completion. It is properly categorized as duo_chat. I think we should track that of the generic Apdex score? Thoughts?
Ask the model validation team to use the GraphQL endpoint instead
I think this should be the long term solution. Currently it's a burden for us to maintain both GraphQL and Rest. And more importantly, we should use the exactly the same process flow with the production, otherwise we can't 100% trust eval performance as production performance.
Instead, there is a separate SLI called llm_completion. It is properly categorized as duo_chat. I think we should track that of the generic Apdex score? Thoughts?
In this case, maybe we don't need to strictly stick with error budget dashboard, but can introduce a new Grafana dashboard. FYI, there are a few chat related SLI in this AI Gateway dashbaord, but we should have a dedicated Duo Chat dashboard for entire services, including both GitLab and AI Gateway.