Review Flux fit for critical workloads

This issue is a review of progress around having a working Kubernetes GitOps solution, where we are and a path forward.

Summary

Starting in Q2 (~6 months ago), we decided to use Flux as the tool to address our Kubernetes GitOps requirements. Currently there are 10 components/services managed by Flux across 3 environments (pre, gstg, ops) and 9 GKE clusters.

While bootstrapping Flux in the gstg and ops environments, we faced challenges/limitations that are making us question the approach, at this moment we don't feel it would be safe or responsible of us to progress any further in adopting Flux as a GitOps solution to manage Kubernetes manifests in critical environments.

Looking back at the initial goals of the work, we are aiming to:

Reduce time/effort to make changes
Reduce blast radius and improve visibility
Support auto-reconciliation and drift detection of manifest state
Manage cluster Custom Resource Definitions (CRDs)

Over the last two quarters, we only managed to cover point 4. Managing CRDs, progress has been slow due to the limitations listed below.

We are re-assessing tooling choice as a short-term solution to bridge product limitations until we are ready to fully adopt Flux.

Demo/video walkthrough of how Flux is currently setup.

Path forward

With the existing Flux state, we can't guarantee reliability or safety around Kubernetes changes due to lack of visibility. Flux can continue to work for non-production environments while the product improves.

Safety gaps in Helmfiles are still a major concern and have led to various incidents, velocity continues to decline as more workloads are added to environments.

Possible options from here:

Do nothing, stay with Helmfiles

As more services are added to clusters, the blast radius will continue to increase, MRs will further slow down due to slow CI jobs, users will be more tempted to diff and apply changes locally, increasing the probability of incidents and state drift.

This is a solve nothing and wait approach, looking upstream there doesn't seem to be any active work in these areas and likely will require investment from GitLab side. Some of the limitations are design choices that will involve refactoring of core components.

Revisit Argo as a solution

During previous discussions around GitOps tooling, a demo of Argo was setup. The demo covered working examples of our workflows, fully addressing the limitations highlighted here.

Argo provides a good low effort short-term solution that bridges our immediate needs/requirements until Flux matures.

With the shift to a Cells architecture, most of this GitOps work might not be useful going forward for .com, opting for a low effort solution will free resources that are better invested elsewhere.

Flux shortcomings

1. Bootstrapping

Deploying Flux to a cluster is mostly driven by CLI, we need this to be automated so we use the Terraform provider which bundles the tooling as Golang dependencies.

"Bootstrap" here is a misleading term, Flux CLI actually manages the lifecycle of the agent/service in the cluster. Any Flux system modifications have to use this method, including updates.

Diffs are unreadable due to how state is exposed to Terraform, no workaround possible
Only one agent/cluster can by modified per MR, repo fails to sync due to conflicts

Manifests are in the clusters repo after the initial bootstrap, but we can't control them directly in an easy and maintainable way, effectively splitting management of the cluster between Terraform and the manifest repos.

2. Helm and manifest Diffing

We leverage Helm charts heavily to manage our workloads, values files (which configure the Helm chart) are customized at different layers and inherited in the following order:

Chart default values
Global values
Environment values
Cluster values

Each layer merges with the previous one, greatly simplifying service configuraiton across multiple clusters as it is easy to set the appropriate configuration for a whole environment without repetition.

Manifests for a new Helm deployment can't be generated, can't see what is actually going to be deployed
Unable to diff a Helm release with values across different Kustomizations
- valuesFrom uses configMaps, with multiple clusters, multiple Kustomizations are required
- ValuesFiles can't be templated and require everything to be in a single repo
Knowing what Kustomization to diff is not easy, there are multiple layers to get configuration across in Flux to support a multi-cluster approach, making CI change scoping unattainable, most cases an environment or global diff will be necessary (similar to Helmfiles).

3. Visibility

Flux is designed to work in a single cluster, multi-cluster functionality is not available. This makes it very difficult to have an overview of what composes a service and where it is running.

Lets take a look at how we can deploy a service in a cluster, going layer by layer of repo structure and the various Flux components, using external-dns as an example which is one of the simplest services we have.

The diagram illustrates in a simplified way the general workflow for a service to be configured and deployed with Flux, as we can see there are quite a few dependencies between Kustomization layers and the Helm/Release Values are spread out over two repos (components and tenant).

Looking into a tenant (Reliability), we have at least two layers of configuration, components folder which source the actual component from the components repo and the overlays folder which adds/overrides components in addition to what the clusters repo holds.

├── components
│  ├── external-dns
│  │  ├── flux-components-external-dns.yaml
│  │  └── kustomization.yaml
├── kustomization.yaml
├── overlays
│  └── gke
│     ├── gitlab-pre
│     │  └── us-east1
│     │     ├── pre-gitlab-gke
│     │     │  ├── external-dns
│     │     │  │  ├── kustomization.yaml
│     │     │  │  └── values.yaml
│     │     │  ├── kustomization.yaml
│     │     └── pre-gl-gke-2
│     │        ├── external-dns
│     │        │  ├── kustomization.yaml
│     │        │  └── values.yaml
│     │        ├── kustomization.yaml
│     └── gitlab-staging-1
│        ├── us-east1
│        │  ├── gitlab-36dv2
│        │  │  ├── external-dns
│        │  │  │  ├── kustomization.yaml
│        │  │  │  └── values.yaml
│        │  │  ├── kustomization.yaml
│        │  └── ...
...

The multitude of layers required to make configuration be inherited and customizable per cluster, brings more complexity than we currently have with Helmfiles, and reduces visibility instead of improving it.

This amplifies the previous topic of diffing, adding or updating a service requires changes across at least two different repos and we can't diff HelmReleases/Kustomizations with unpublished resources (e.g. Values from Tenant) as targets.

Edited Jul 30, 2024 by Sylvester Chin