Proposal: Unified installer for GitLab Reference Architectures across VM and k8s
Problem statement
GET is currently scoped for providing a tool to easily deploy and upgrade VM-based reference architectures. This has huge benefits on its own, but there are still other deployment types which could benefit from GET.
- Hybrid deployments: Stateful on VM (Redis, PG, Gitaly), Stateless on k8s (API, web, sidekiq, etc.)
- Private cloud k8s deployments: Redis/PG on VM, GitLab services on k8s
- We do not recommend using our built-in Redis or PG solutions, and direct users to seek other solutions. While most cloud deployments can use managed services like RDS, on-premise users generally cannot and need a different solution. GET could fill this gap.
- Cloud k8s deployment: Redis/PG on cloud managed service, GitLab services on k8s
- We do not recommend using our built-in Redis or PG solutions, but we can make it easy to leverage RDS/ElastiCache/etc.
- There are also some concerns on running Gitaly at scale in k8s, so we could use GET to stand these up in VM's as well
Potential solutions
Provide a single unified tool
GET is capable of deploying services on either VM's (Terraform/Ansible) or in Kubernetes (Helm/Ansible). Users can select which services they want to be deployed where, and GET will deploy on the respective nodes.
This model has a few benefits:
Ease of use
Users who want a hybrid or full-k8s approach are able to get started more quickly, as GET can provision stateful services like PG/Redis/Gitaly/ObjectStorage which we generally don't recommend running in k8s. This saves users the work of having to manually provision and copy/paste the settings, and pass those into the Helm settings for the stateless services to use.
Additionally, GET can coordinate an upgrade across the services in a single command rather than relying on multiple commands and coordination when attempting to upgrade manually.
Opportunity to steer users to the best solution
By having a common starting point for either VM's or K8s, we can help ensure the "default" setting is the best for users. Perhaps in the future we could deploy a small k8s cluster for GitLab to run in, rather than using our Omnibus, as the default.
Risks
- The primary concern on increasing the scope to include deploying k8s-based GitLab services is complexity. Supporting deploying GitLab on VM across multiple cloud vendors and on-premise is a tall order as-is. Adding in different services hosted via k8s, adds significant additional complexity. Managing both installs and upgrades across all of these matrixes options could be significant. That said - Terraform does have a Helm provider, so perhaps it is not as complex as I fear.
- Long term obsoletion by further advancements in k8s tooling (future state, minor concern)
- Services such as Crossplane and other service brokers can make it easy for our operator to stand up requiring dependencies like PG and Redis when deployed in a cloud environment.
- Advancements in other open-source tool operators, such as Postgres. If there is a "winner" of the PG Operators that runs well in k8s, we could also use that as a dependency for a "production" cluster.
GET support stateful VM services only [already supported]
Note: This is already supported today - https://gitlab.com/gitlab-org/quality/performance/-/issues/145
We add the support into GET for standing up just stateful services, and outputting any required configuration for another service to connect.
The admin can take that content and plug it into the Helm chart for consumption.
Risks
- Upgrades: with an upgrade spread across two tools, coordinating the upgrade process could be significantly more difficult
- Extra effort on behalf of our users to get started
Next steps
-
Confirm the blockers holding us back from deploying Gitaly in Kubernetes. If we can do this, our only remaining services are non-GitLab things like Redis/PG/ES. This will reduce the applicable users cases for this, as it is relatively easy to stand up PG/Redis on most cloud providers. - @WarheadsSE confirmed we are still some distance away due to items like possibility of being killed: https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/issues/79#note_514604094
-
Confirm the incremental complexity and maintenance costs we would be taking on with this approach.