Deep dive into load spike for urgent-other

Context: Even after dramatically increasing the node size for urgent-other and increasing the number of pods to more than double number of workers we have on VMs we saw a spike in authorized_projects force a scaling event which where 20 new pods were created, which pegged the cpu and caused delays for other urgent job processing.

production#2254 (comment 359661198)

This issue is to collect information and come up with a plan for next steps, here is what we know:

The urgent-other workload is isolated to its own node pool
We have the following node specs in the node pool:
- 3 x n1-standard-32 = 96 cpus and 360GB of memory $699/month
Sidekiq running on VMs has the following specs:
- 10 x n1-standard-4 = 40 cpus and 150GB of memory $290/month
On VMs we have 80 sidekiq processes X 5 threads to handle requests
On K8s we now have 180 replicas, and will max out 300
On K8s, for sidekiq the HPA has a targetAverageValue of 450m (.45 cores), this is set for all shards. What this means is that we will start to scale up pods if the average cpu time for all pods goes above .45 cores
On 2020-06-11 14:48 we started processing all jobs on the K8s cluster
On 2020-06-11 from 15:00 to 15:10 we started to see cpu saturation so we started the VMs again to split the load
We can see here https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=avg%20by%20(pod)%20(rate(container_cpu_usage_seconds_total%7Benvironment%3D%22gprd%22%2C%20pod_name%3D~%22gitlab-sidekiq-urgent-other.*%22%7D%5B1m%5D))&g0.tab=0 that the average cpu across all pods started to exceed .45 cores
What exactly happened here that resulted in a cpu utilization spike? ^
This caused 20 pods to be created https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=avg(kube_replicaset_spec_replicas%7Breplicaset%3D~%22%5Egitlab-sidekiq-urgent.*%22%2C%20cluster%3D%22gprd-gitlab-gke%22%7D)%20by%20(replicaset)&g0.tab=0 which correlates to degraded performance on all queues.
It looks like a spike in authorized_projects was the cause of this, which is normal spikeness for this queue
- queue processing - https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=sum(queue%3Asidekiq_jobs_completion%3Arate1m%7Benvironment%3D%22gprd%22%2C%20shard%3D~%22urgent-other%22%7D)%20by%20(queue)&g0.tab=0
- queued jobs, note that authorized_projects leads the backlog in other queues https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=sum%20by%20(queue)%20(%0A%20%20(%0A%20%20%20%20label_replace(%0A%20%20%20%20%20%20sidekiq_queue_size%7Benvironment%3D%22gprd%22%7D%20and%20on(fqdn)%20(redis_connected_slaves%20!%3D%200)%2C%0A%20%20%20%20%20%20%22queue%22%2C%20%22%240%22%2C%20%22name%22%2C%20%22.*%22%0A%20%20%20%20)%0A%20%20)%0A%20%20and%20on%20(queue)%0A%20%20(%0A%20%20%20%20max%20by%20(queue)%20(%0A%20%20%20%20%20%20rate(sidekiq_jobs_queue_duration_seconds_sum%7Benvironment%3D%22gprd%22%2C%20shard%3D~%22urgent-other%22%7D%5B1m%5D)%20%3E%200%0A%20%20%20%20)%0A%20%20)%0A)&g0.tab=0

Edited Jun 15, 2020 by John Jarvis