Consider More Transparent and Implementation Agnostic Requirements Statements and Calculations in Reference Architectures.
This was originally seeded by discussions on this issue: #20 (closed)
The reference architectures keep handling things in terms of Kubernetes node numbers - and specifically in terms of Google instances.
For example in this issue the nodes x GCP node type seems to be short hand for a specific vCPU + Memory requirement.
I think this creates some challenges:
- I think the choice of nodes - especially for K8s nodes should be done from pod specs and overhead total vCPUs + Memory. This would be more efficient because there can be many possible instance combinations to meet that total within and between any given virtual node hardware at any providers. For instance:
- Each cloud has instances that emphasize CPU or memory in different ratios.
- In some clouds some instance types are not available in all regions or there are not instances of a certain class available in a region DURING provisioning or scaling. (E.g. in AWS we can use something called "Ec2 Fleet" to ask for "any instance type from this list to prevent "instance type exhaustion" - this actually results in multiple instance vCPU / Memory requirements in the same cluster - but is better than failing to provision)
- GCP has vertical autoscaling - while I don't know if this applies to GKE nodes - it's an example of an innovation that makes stating requirements in vCPU and Memory even more relevant.
- processing efficiencies in hardware (such as ARM) can be more easily translated.
- customers who choose savings plans that lock on instance types will want to prefer those types regardless of other concerns.
- Translating requirements and measurements by looking up instance types on one cloud provider is challenging.
- It makes our Reference Architecture efforts appear somewhat Google centric - and while that is where all of this is tested, I think it should generally be treated as an implementation issue.
Perhaps we could consider the following to make the requirements and calculations in reference architectures more transparent and less implementation specific:
- All requirements were stated in vCPU and Memory
- All requirements for a cluster were summated in vCPU and Memory
- Reference Architecture GPT tests stated the actual tested node configuration - including cloud specific details such as google instances.
- If reference architectures give node guidance at all, it should be stated in neutral terms. Such as "due to the vCPU and memory blocking of pod requirements, some cloud instances will more evenly fit a fixed number of a specific pod for fixed-scaling activities like performance testing." and/or "Choosing overly small node sizes may increase per node waste due to the sizes of GitLab pods."
I am personally concerned about even doing the last bullet above because I feel like when you aggregate a huge variety of vCPU and Memory requirements there are many places where efficiencies are realized. Also, if that's not true, adding one or two small K8s nodes can easily make up the shortfall at a reasonable cost - but there isn't likely a need to opine exact hosts too much.
FYI I am dog fooding the above in tables like this: