Use Flux as GitOps solution for infrastructure workloads
# Context ## Current State of k8s Workloads Management Infrastructure teams currently deploy Kubernetes workloads using two different mechanisms: * [gitlab-helmfiles](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles): Uses [helmfile](https://github.com/helmfile/helmfile) for deploying helm chart releases to multiple environments and clusters. * [tanka-deployments](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/tanka-deployments): Uses [tanka](https://tanka.dev/) for helm chart deployment. There are several challenges we experience with our current setup: - High complexity: - Two different tooling and deployment mechanisms for managing k8s workloads. - On gitlab-helmfiles we don't have a good way to establish dependencies and/or precedence between the releases. The current setup has [implicit dependencies](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master#implicit-inter-release-dependencies) which makes it harder to deploy to test clusters where we don't want to deploy every release. - Complex to provide multi-tenancy support and ownership. Requires granting other teams permissions to the repo and modifying the CODEOWNERS to restrict who can merge/approve MRs. - For Tanka, having knowledge of jsonnet language is a requirement. Also some libsonnet libraries are hard to understand and not properly documented. - Long Deployment time: - Every change in an MR triggers jobs for every environment, which does a full diff of all releases installed on the environment, not only the one that's being changed. - No constant reconciliation leads to drift if manual changes happen to the workloads. - Workloads changes are not detected by the current tools on real time. Drift changes are only detected when the CI pipelnes are executed. - Hard to onboard new applications. Requires at least the following changes: - Add new environment and its configuration. - Add some baseline releases like cert-manager, external-dns, vault-k8s-secret, monitoring, etc. - Add a new release for the application's Helm Chart and it's corresponding values. - If something is not handled by the HelmRelease we have the capability of configuring Raw kubernetes manifests, but requires adding an extra chart to support it. - See https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25957 for more details on complexity. - Deployment failures trigger a helm rollback which [sometimes requires manual intervention](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master#helm-rollback). ## GitOps Evolution and Flux GitOps tooling has evolved significantly, with tools like [ArgoCD](https://argoproj.github.io/cd/) and [Flux](https://fluxcd.io/) emerging as mature and powerful solutions for managing infrastructure workloads. [Gitlab has chosen Flux as the recommended GitOps tooling](https://about.gitlab.com/blog/2023/02/08/why-did-we-choose-to-integrate-fluxcd-with-gitlab/). Due to Gitlab's significant involvement with the Flux project, Infrastructure teams are encouraged to use Flux as their GitOps tool of choice for dogfooding purposes. Runway has already paved the way and it's using Flux for their ongoing k8s effort. ### Flux benefits over helmfile and Tanka: 1. Native Kubernetes Integration: Flux runs as a Kubernetes Operator and doesn't require CI pipelines to function, offering deeper integration than both helmfile and Tanka by leveraging the k8s Control Plane. 1. Automated Synchronization: Flux continuously monitors Git repositories and automatically applies changes, eliminating the need for manual triggers or additional CI/CD setups required by helmfile and Tanka. 1. Multi-tenancy and Multi-cluster Support: Flux has built-in capabilities to manage multiple environments, clusters, and teams more effectively than both alternatives. 1. Comprehensive Resource Management: Unlike helmfile (focused on Helm) or Tanka (using Jsonnet), Flux can manage Helm releases, raw Kubernetes manifests, and Kustomize resources in one tool. 1. Enhanced Security: Flux includes features like artifact signing and verification, providing an additional layer of security not present in helmfile or Tanka. 1. GitOps-native Approach: Flux is purpose-built for GitOps, offering a more streamlined workflow compared to the more general-purpose nature of helmfile and Tanka. 1. Image Automation: Flux can automatically update container images in your Git repository, a feature not available in either helmfile or Tanka. 1. Improved Observability: Flux provides better insights into the reconciliation process, making it easier to track and debug deployments compared to both alternatives. It provides metrics that can be exported to Prometheus, visualized in Grafana for improved observability and alerting capabilities. 1. Broader Ecosystem: As part of the CNCF and with GitLab's support, Flux has a larger community and ecosystem, potentially leading to more integrations and support than helmfile or Tanka. # Goals Improve Kubernetes Workload management and deployment What we will do: Simplify Kubernetes Workload management and deployment with a GitOps approach How we will do it: Replace the usage of helmfile and tanka with Flux. What impact it will have: - Long-term benefits in terms of simplification, automation, security, and alignment with GitOps principles. - By having a standarized deployment mechanism we will realign our environments to create better stability, maintainability, and reduce the cognitive load of engineers at GitLab. - Will greatly reduce deployment time, automatically keep track of changes and deploy it preventing drift without requiring manual intervention from SREs. - Easier multi-tenancy setup that will allow us to delegate control to stage teams to onboard their applications, while using best practices and standardized deployment mechanisms. - Paves the way to integrate with Platform tooling like Crossplane or Kratix. ## Acceptance Criteria: - [ ] Upgrade FluxCD to the latest version. - [ ] Revisit and simplify FluxCD [k8s-mgmt](https://gitlab.com/gitlab-com/gl-infra/k8s-mgmt) repo structure. - [ ] Implement mechanism to get Helm Release diffs in CI. - [ ] Implement UI to visualize FluxCD. Eg [Capacitor](https://github.com/gimlet-io/capacitor) - [ ] Add [integration testing](https://github.com/fluxcd/flux2-multi-tenancy?tab=readme-ov-file#testing) to FluxCD https://gitlab.com/gitlab-com/gl-infra/k8s-mgmt repos. - [ ] Complete migration from Gitlab Helmfiles to FluxCD for partially migrated services like cert-manager and external-dns. - [ ] Write Production Readiness for FluxCD. - [ ] Bootstrap FluxCD in Production clusters. - [ ] Deploy Foundation owned services using FluxCD instead of helm-files. ### Epic Labels Apply the following labels to all issues under this Epic: ``` # /label ~"team::Foundations" ~"Production Engineering::P3" ~"Foundations::Project work" ```
epic