Corrective action: The shard_imports SLI of the sidekiq service (main stage) has an apdex violating SLO
Summary
The import shard for handling import sidekiq work has it's HPA auto-scaler disabled. There are several reasons for this historically, but in some cases, adding a few more pods can help clear bottlenecks when many large, long-running imports are clogging the sidekiq queue. The inability to scale dynamically is also an anti-pattern.
This MR has some good recent conversation about current approaches considered to fix this and what their impacts may be.
In gitlab-com/gl-infra/k8s-workloads/gitlab-com!2210 (merged) we introduced auto-scaling. This helps a bit, however due to the burstiness of imports, we often scale to max number of replicas while still alerting.
CPU and/or memory usage are not a good indication of the work that needs to be done for this workload to drive auto-scaling, for that we need to bring external metrics to the HPA, something that indicates the number of jobs pending or how old they are (like we do for pubsubbeat).
That represents challenge number 1, we need a Prometheus adapter to comunicate with HPA, helmfile release here. We can only have a single custom metric adapter per cluster, so we have to point it to Thanos (and solve any issues that arise from that).
Challenge number 2, build a metric/query that represents an indication of work to be done based on the number of pending jobs and how old they are.
Alternative to the above work, would be to tune the existing alert and accept the fact that when we get hundreds of imports, there will be a delay.
Related Incident(s)
Originating issue(s):
Desired Outcome/Acceptance Criteria
This is an investigation action. Can we find a way to allow scaling with the auto-scaler for the import shard without causing larger problems with deploys or stopping pods that are handling a long running (possibly 1.5 to 2 hours) import?
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')