AI Framework - Error budget improvements
Problem Statement
As per the investigation here https://gitlab.com/gitlab-org/ai-powered/ai-framework/team-hq/-/issues/1#note_1576939796, we want to explore some options for addressing some of our error budget issues.
55 out of 80 LLM Workers [fail due to ](https://log.gprd.gitlab.net/app/visualize#/create?type=table&indexPattern=AWNABDRwNDuQHTm2tH6l&_a=\(filters:!\(\('$state':\(store:appState\),meta:\(alias:!n,disabled:!f,index:AWNABDRwNDuQHTm2tH6l,key:json.meta.feature_category,negate:!f,params:!\(ai_abstraction_layer\),type:phrases,value:!\(ai_abstraction_layer\)\),query:\(bool:\(minimum_should_match:1,should:!\(\(match_phrase:\(json.meta.feature_category:ai_abstraction_layer\)\)\)\)\)\),\('$state':\(store:appState\),meta:\(alias:!n,disabled:!f,index:AWNABDRwNDuQHTm2tH6l,key:json.job_status,negate:!f,params:\(query:fail\),type:phrase\),query:\(match_phrase:\(json.job_status:\(query:fail\)\)\)\),\('$state':\(store:appState\),meta:\(alias:!n,disabled:!f,field:json.exception.class,index:AWNABDRwNDuQHTm2tH6l,key:json.exception.class,negate:!f,params:\(query:'Net::ReadTimeout'\),type:phrase\),query:\(match_phrase:\(json.exception.class:'Net::ReadTimeout'\)\)\)\),linked:!f,query:\(language:kuery,query:''\),uiState:\(\),vis:\(aggs:!\(\(enabled:!t,id:'1',params:\(emptyAsNull:!f\),schema:metric,type:count\),\(enabled:!t,id:'2',params:\(excludeIsRegex:!t,field:json.class.keyword,includeIsRegex:!t,missingBucket:!f,missingBucketLabel:Missing,order:desc,orderBy:'1',otherBucket:!t,otherBucketLabel:Other,size:5\),schema:bucket,type:terms\)\),params:\(autoFitRowToContent:!f,perPage:10,percentageCol:'',showMetricsAtAllLevels:!f,showPartialRows:!f,showToolbar:!f,showTotal:!f,totalFunc:sum\),title:'',type:table\)\)&_g=\(time:\(from:'now-7d',to:'now'\)\))`Net::ReadTimeout`.
Here is a link to a single captured error: https://log.gprd.gitlab.net/app/discover#/doc/AWNABDRwNDuQHTm2tH6l/pubsub-sidekiq-inf-gprd-003640?id=toEKvIoBpdGZRdmwIALU
The job got started at `Sep 22, 2023 @ 08:37:44.510`, and errored at `Sep 22, 2023 @ 08:38:14.420`. Which is very close to 30 seconds and would align with the `write_timeout` we specify as default in [Gitlab::HTTP](https://gitlab.com/gitlab-org/gitlab/-/blob/6903c195035ec453dbe3d2d6af35a4dcae676584/lib/gitlab/http.rb#L32).
Why "write timeout" as we don't write anything? We're using a `POST` request on the Anthropic client. We should have a closer look as we also track the durations of these HTTP requests, but I guess we might need to increase the timeout.
We can increase the `MAX_RUN_TIME` to like 30 seconds to match the max HTTP response time. However it could take *multiple* HTTP requests that each are allowed to take max. 30 seconds for the chat to complete.
Waiting more than 20 or 30 seconds for an answer is not acceptable for the user.. but it's also not the truth, because it might be that the user waits 10 seconds till the streamed response starts and the stream itself takes 15 seconds. The user experience is then still okay-ish.
TL;DR:
* Let's bump `MAX_RUN_TIME=30.seconds` to match max HTTP request time
* As a follow-up: Our apdex should not be about "time till full completion". For a streamed response, so it should be "time till first byte"
Exit Criteria
- Increase MAX_RUN_TIME: Adjust the MAX_RUN_TIME from the current value of 20 seconds to 30 seconds to align with the maximum HTTP request time.
- Reevaluate SLAs: Review and recalculate the SLAs for the worker based on the updated MAX_RUN_TIME to ensure accuracy in our service level commitments.