Infra investigation followup: sidekiq_queueing SLI of the sidekiq service on shard urgent-other has an apdex violating SLO
Summary
On 2023-07-28, 2023-07-30 and 2023-07-31 we saw 10 PD pages for a similar issue on Sidekiq (see https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24188)
So far we have documented this in two incidents, production#16096 (closed) and production#16135 (closed)
@pguinoiseau comments in production#16096 (comment 1502304647)
There was a re-occurence today: production#16135 (closed) With KEDA's cron trigger we could scale up the HPA a few minutes before those hourly scheduled pipelines, which could help avoid this.
I would like to use this issue to collect next steps for this to see if we can avoid getting paged for this incident moving forward.
Some ideas off the top of my head:
- Is the scheduling latency spikes we are seeing acceptable to users?
- Do we want to isolate these workers that see spikes every hour in to their own shard?
- Should we over-provision and tune the HPA for now so we don't get paged?
- Do want to try prioritizing something like KEDA to automatically scale up at the top of the hour?
cc @pguinoiseau @gsgl @anganga @ahanselka who were involved with prior incidents.
cc @stejacks-gitlab from Scalability who has been looking at Sidekiq recently.