Corrective action: The shard_imports SLI of the sidekiq service (main stage) has an apdex violating SLO
## Summary The import shard for handling import sidekiq work has it's HPA auto-scaler disabled. There are several reasons for this historically, but in some cases, adding a few more pods can help clear bottlenecks when many large, long-running imports are clogging the sidekiq queue. The inability to scale dynamically is also an anti-pattern. This [MR](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/2210) has some good recent conversation about current approaches considered to fix this and what their impacts may be. In https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/2210 we introduced auto-scaling. This helps a bit, however due to the burstiness of imports, we often scale to max number of replicas while still alerting. CPU and/or memory usage are not a good indication of the work that needs to be done for this workload to drive auto-scaling, for that we need to bring external metrics to the HPA, something that indicates the number of jobs pending or how old they are (like we do for pubsubbeat). That represents challenge number 1, we need a Prometheus adapter to comunicate with HPA, helmfile release [here](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/a809e39de1d9b57bf5522d7f4946ac6918b2ccdb/releases/prometheus-adapter/helmfile.yaml). We can only have a single custom metric adapter per cluster, so we have to point it to Thanos (and solve any issues that arise from that). Challenge number 2, build a metric/query that represents an indication of work to be done based on the number of pending jobs and how old they are. Alternative to the above work, would be to tune the existing alert and accept the fact that when we get hundreds of imports, there will be a delay. [More historical context](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/13582) ## Related Incident(s) <!-- Note the originating incident(s) and link known related incidents/other issues. The relation will happen automatically if you are creating this issue from an incident, if this isn't done already please uncomment the following line: --> Originating issue(s): - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7903 - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8173 ## Desired Outcome/Acceptance Criteria This is an investigation action. Can we find a way to allow scaling with the auto-scaler for the import shard without causing larger problems with deploys or stopping pods that are handling a long running (possibly 1.5 to 2 hours) import? ## Associated Services <!-- Apply the appropriate services associated with this corrective action if applicable. ~Service::SERVICE_NAME --> ## Corrective Action Issue Checklist * [x] Link the incident(s) this corrective action arose out of * [x] Give context for what problem this corrective action is trying to prevent from re-occurring * [x] Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') * [x] Assign a priority (this will default to 'Reliability::P4')
issue