Decide on requests/limits and number of pods per node
Problem Statement
Consider bumping web to the same as the API for the following configurations in Kubernetes:
- Worker Count: 6
- HPA 3600m
- RAM limit: 12G
- RAM Requests: 7G
- CPU Requets: 4500m
We are currently mid migration: production#5359 (closed) Let's hold off on the above changes until the migration is completed. After the migration we should perform this one cluster at a time to validate the web behaves appropriately and that we do not see any negative impact.
Following this we should clean up the configurations in lower environments, such as the canary stage, staging, etc.
Status 2021-08-24
- See details in #1873 (comment 659164601)
- HPA is modified lower due to horrible apdex values during the migration
| Environment | Target CPU | MinReplicas | MaxReplicas | Worker Count |
|---|---|---|---|---|
gprd |
2200m |
25 | 150 | 4 |
Status 2021-08-19
- Web is currently running in All environments except production, with the settings specified as follows
| Environment | Workhorse Requests | Workhorse Limits | webservice requests | webservice limits |
|---|---|---|---|---|
pre |
CPU: 100m, MEM: 50M |
MEM: 1G |
CPU: 1, MEM: 1250M |
MEM: 4G |
gstg |
CPU: 600m, MEM: 200M |
MEM: 2G |
CPU: 4, MEM: 5G |
MEM: 6G |
gprd/cny |
CPU: 600m, MEM: 200M |
MEM: 2G |
CPU: 4, MEM: 5G |
MEM: 8G |
| Environment | Target CPU | MinReplicas | MaxReplicas |
|---|---|---|---|
pre |
1600m |
2 | 5 |
gstg |
1600m |
2 | 30 |
cny |
3600m |
16 | 50 |
gprd |
3600m |
25 | 150 |
Once web migration is complete, we will review this metrics based off utilisation patterns we see, to try and reduce the number of pods (if possible) and make sure there are no resource saturation
Original Issue Description
Expand for original issue description
Tentative idea: 1 pod per node (not pursued at this time)
Following on a conversation myself and @jarv had on different strategies to avoid increasing the amount of nodes we will be running for web as part of the Kubernetes migration, we decided to consider an approach that instead of trying to have multiple pods on a single node, each with a lower worker count, we try to only fit one pod on a node, but have the same worker count we had on VMs (16).
This has some pros and cons
Pros:
- Pod count and node count should be consistent (equal) outside of scaling events
- We won't have lost capacity due to us not being able to squeeze in another pod on a node
Cons:
- We have to wait for an entire new node to come up before we can add a new pod (HPA overall will be slower)
- Failure of a single pod will have a larger dip in capacity