Deep dive into load spike for urgent-other
Context: Even after dramatically increasing the node size for urgent-other
and increasing the number of pods to more than double number of workers we have on VMs we saw a spike in authorized_projects
force a scaling event which where 20
new pods were created, which pegged the cpu and caused delays for other urgent job processing.
production#2254 (comment 359661198)
This issue is to collect information and come up with a plan for next steps, here is what we know:
- The urgent-other workload is isolated to its own node pool
- We have the following node specs in the node pool:
- 3 x n1-standard-32 = 96 cpus and 360GB of memory $699/month
- Sidekiq running on VMs has the following specs:
- 10 x n1-standard-4 = 40 cpus and 150GB of memory $290/month
- On VMs we have 80 sidekiq processes X 5 threads to handle requests
- On K8s we now have 180 replicas, and will max out 300
- On K8s, for sidekiq the HPA has a
targetAverageValue
of 450m (.45 cores), this is set for all shards. What this means is that we will start to scale up pods if the average cpu time for all pods goes above .45 cores - On 2020-06-11 14:48 we started processing all jobs on the K8s cluster
- On 2020-06-11 from 15:00 to 15:10 we started to see cpu saturation so we started the VMs again to split the load
- We can see here https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=avg%20by%20(pod)%20(rate(container_cpu_usage_seconds_total%7Benvironment%3D%22gprd%22%2C%20pod_name%3D~%22gitlab-sidekiq-urgent-other.*%22%7D%5B1m%5D))&g0.tab=0 that the average cpu across all pods started to exceed .45 cores
- What exactly happened here that resulted in a cpu utilization spike? ^
- This caused 20 pods to be created https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=avg(kube_replicaset_spec_replicas%7Breplicaset%3D~%22%5Egitlab-sidekiq-urgent.*%22%2C%20cluster%3D%22gprd-gitlab-gke%22%7D)%20by%20(replicaset)&g0.tab=0 which correlates to degraded performance on all queues.
- It looks like a spike in
authorized_projects
was the cause of this, which is normal spikeness for this queue- queue processing - https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=sum(queue%3Asidekiq_jobs_completion%3Arate1m%7Benvironment%3D%22gprd%22%2C%20shard%3D~%22urgent-other%22%7D)%20by%20(queue)&g0.tab=0
- queued jobs, note that
authorized_projects
leads the backlog in other queues https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.end_input=2020-06-11%2016%3A00&g0.max_source_resolution=0s&g0.expr=sum%20by%20(queue)%20(%0A%20%20(%0A%20%20%20%20label_replace(%0A%20%20%20%20%20%20sidekiq_queue_size%7Benvironment%3D%22gprd%22%7D%20and%20on(fqdn)%20(redis_connected_slaves%20!%3D%200)%2C%0A%20%20%20%20%20%20%22queue%22%2C%20%22%240%22%2C%20%22name%22%2C%20%22.*%22%0A%20%20%20%20)%0A%20%20)%0A%20%20and%20on%20(queue)%0A%20%20(%0A%20%20%20%20max%20by%20(queue)%20(%0A%20%20%20%20%20%20rate(sidekiq_jobs_queue_duration_seconds_sum%7Benvironment%3D%22gprd%22%2C%20shard%3D~%22urgent-other%22%7D%5B1m%5D)%20%3E%200%0A%20%20%20%20)%0A%20%20)%0A)&g0.tab=0
Edited by John Jarvis