Skip to content

Define a guideline for Review Apps resources requests and limits to avoid overcommitting nodes and improve cluster stability

After watching https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits, I realized that while resources requests are used by the Kubernetes scheduler to decide on which node to schedule a pod, the resource limits are only there for Kubernetes to:

  1. Throttle pods if the CPU limit is reached
  2. Evict pods if the memory limit is reached

Given this information and:

  1. The fact that our nodes p99 CPU utilization is above 100%: Screen_Shot_2019-10-29_at_16.51.48
  2. The fact that we can adjust each pod's resource requests based on their actual p99 utilization (based on Screen_Shot_2019-10-29_at_16.53.13)

I think we should adjust:

  1. Resources requests so that p99 CPU utilization for every component is between 80% and 100%.
  2. Resources limits so that p99 CPU utilization for every component is be always below 70%.
  3. Resources requests so that p99 memory utilization for every component is between 80% and 100%.
  4. Resources limits so that p99 memory utilization for every component is below 70%.

CPU proposal

Component Current p99 CPU request utilization Current p99 CPU limit utilization Current CPU request Current CPU limit Proposed CPU request Proposed CPU limit
gitaly 188% 93% 600m 1200m 1200m (600m * 2) 1800m (1200m * 1.5)
gitlab-shell 170% 85% 125m 250m 230m (125m * 1.84) 345m (230m * 1.5)
sidekiq 125% 85% 500m 1000m 650m (500m * 1.3) 975m (650m * 1.5)
unicorn 95% 65% 400m 800m 500m (400m * 1.25) 750m (500m * 1.5)
gitlab-workhorse 67% 34% 300m 600m 250m (300m * 0.83) 375m (250m * 1.5)
gitlab-runner 120% 60% 355m 710m 450m (355m * 1.26) 675m (450m * 1.5)
nginx-ingress/controller 27% 15% 100m 200m 100m 200m
nginx-ingress/defaultBackend 20% 13% 5m 10m 5m 10m
postgresql 95% 60% 250m 500m 300m (250m * 1.2) 450m (300m * 1.5)
redis 20% 10% 100m 200m 100m 200m

Memory proposal

Component Current p99 MEM request utilization Current p99 MEM limit utilization Current MEM request Current MEM limit Proposed MEM request Proposed MEM limit
gitaly 120% TBD 200M 420M 240M (20M * 1.2) 360M (240M * 1.5)
gitlab-shell 125% TBD 20M 40M 25M (20M * 1.25) 37.5M (25M * 1.5)
sidekiq 110% TBD 800M 1.6G 880M (800M * 1.1) 1320M (880M * 1.5)
unicorn 110% TBD 1.4G 1.8G 1540M (1400M * 1.1) 2310M (1540M * 1.5)
gitlab-workhorse 30% TBD 100M 200M 50M (100M * 0.5) 75M (50M * 1.5)
gitlab-runner 12% TBD 300M 600M 100M (300M * 0.3) 150M (100M * 1.5)
nginx-ingress/controller 180% TBD 250M 500M 450M (240M * 1.8) 675M (450M * 1.5)
nginx-ingress/defaultBackend 50% TBD 12M 24M 12M 24M
postgresql 85% TBD 256M ? 250M 375M (250M * 1.5)
redis 25% TBD 60M 130M 30M (60M * 0.5) 45M (30M * 1.5)
Edited by Rémy Coutable