Occasional latency spikes for sidekiq authorized_projects queue
We've observed occasional spikes in queue time for the authorized_projects job. This is in the realtime queue, and should be low-latency.
A recent example: see https://dashboards.gitlab.net/d/sidekiq-priority-detail/sidekiq-priority-detail?orgId=1&from=1577684723852&to=1577712583793&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-priority=realtime and https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=authorized_projects&from=1577689416442&to=1577720974571.
Above, you can see a surge in queue time lasting several minutes. This coincided with a dramatic increase in QPS, which plateued. This coincided with saturation of sidekiq workers, but not CPU:
This points to nodes being (slightly) underutilised. I've noticed we have a 2:1 ratio of cores to sidekiq processes on realtime nodes, unlike our other priorities which have 1:1. Am I right to assume that's intentional, to buy us plenty of headroom for a latency-sensitive queue? If so we could try scaling out the realtime pool.
One possibly contradictory point to note is that back in the first chart we see execution time per job rising and plateauing, and also with sql timings:
However, unlike recent issues with post_receive (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8824), we do not see QPS drop as slow jobs clog the queues.
My instinct is to try scaling out the worker pool and observe any change in frequency of these spikes after a week.
I'm not sure if this is ~"team::Scalability" or not, but I'll label it as such for now (@andrewn please correct me if I'm wrong).
cc @smcgivern, with whom I spoke about this on Slack.
cc @hphilipps @devin @ahanselka, on-calls at the time of writing, in case you are alerted to this queue growth during your shift.
Sort-of-but-not-really related to scalability#86 (closed) (@msmiley), which also concerns this queue. This issue is predicated on the assumption that there is no resolution to growth in queue length here beyond increasing capacity, since all of these jobs must run.