Kubernetes cluster optimization
One of the advantages of Kubernetes is it's ability to leverage the scheduler for intelligent bin stacking. This allows companies to achieve a higher utilization rate of their compute resources than trying to manage jobs and VM sizes manually, because it can dynamically look at your nodes and required workloads to try to optimize the distribution.
In practice however, it is quite easy for a few bad actors to significantly erode that efficiency. For example someone may slightly over-estimate a CPU request for n app, and at scale that could trigger a significant waste of resources.
GitLab is well positioned to provide a comprehensive solution here:
- We know the performance characteristics of each pod and node through our Prometheus integration.
- We know which pods correspond to which projects. This is important for shared clusters, which offer the greatest opportunities for efficiency gains due to scale.
- We can even track changes in usage to individual commits/deploys.
- As a further enhancement, a company could provide node costs, and we can then translate usage into dollars. We could then provide estimated costs, as well as helping companies internally bill for infrastructure.
Companies seem to be primarily doing this manually right now, with a resource or two dedicated to optimizing the compute spend on cloud providers. It is well worth the headcount however, as compute costs are a major expense.
There are a few iterations we could go through here:
- Analytics and reporting on cluster efficiency, creating issues and blaming the worst offenders.
- Provide recommendations on changes to make to increase efficiency. (Increase RAM of nodes, decrease CPU request of pod X, etc.)
- Automatically implement the changes to right-size pod requests.
This feature could pay for the GitLab Ultimate license by itself pretty quickly, if we do it well, for companies at certain scale.