Decide on requests/limits and number of pods per node

Problem Statement

Consider bumping web to the same as the API for the following configurations in Kubernetes:

  • Worker Count: 6
  • HPA 3600m
  • RAM limit: 12G
  • RAM Requests: 7G
  • CPU Requets: 4500m

We are currently mid migration: production#5359 (closed) Let's hold off on the above changes until the migration is completed. After the migration we should perform this one cluster at a time to validate the web behaves appropriately and that we do not see any negative impact.

Following this we should clean up the configurations in lower environments, such as the canary stage, staging, etc.

Status 2021-08-24

Environment Target CPU MinReplicas MaxReplicas Worker Count
gprd 2200m 25 150 4

Status 2021-08-19

  • Web is currently running in All environments except production, with the settings specified as follows
Environment Workhorse Requests Workhorse Limits webservice requests webservice limits
pre CPU: 100m, MEM: 50M MEM: 1G CPU: 1, MEM: 1250M MEM: 4G
gstg CPU: 600m, MEM: 200M MEM: 2G CPU: 4, MEM: 5G MEM: 6G
gprd/cny CPU: 600m, MEM: 200M MEM: 2G CPU: 4, MEM: 5G MEM: 8G
Environment Target CPU MinReplicas MaxReplicas
pre 1600m 2 5
gstg 1600m 2 30
cny 3600m 16 50
gprd 3600m 25 150

Once web migration is complete, we will review this metrics based off utilisation patterns we see, to try and reduce the number of pods (if possible) and make sure there are no resource saturation

Original Issue Description

Expand for original issue description

Tentative idea: 1 pod per node (not pursued at this time)

Following on a conversation myself and @jarv had on different strategies to avoid increasing the amount of nodes we will be running for web as part of the Kubernetes migration, we decided to consider an approach that instead of trying to have multiple pods on a single node, each with a lower worker count, we try to only fit one pod on a node, but have the same worker count we had on VMs (16).

This has some pros and cons

Pros:

  • Pod count and node count should be consistent (equal) outside of scaling events
  • We won't have lost capacity due to us not being able to squeeze in another pod on a node

Cons:

  • We have to wait for an entire new node to come up before we can add a new pod (HPA overall will be slower)
  • Failure of a single pod will have a larger dip in capacity
Edited by John Skarbek