Discussion: Issues with the current `k8s-workloads/gitlab-com` repository

This issue is to try and provide a single up-to-date place to describe all the issues we face with the current setup of the https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com repository. This can includes issues related to its use in auto-deploy, as well as issues related to its use for rolling out configuration changes outside of auto-deploy (and often how these two things overlap). The issue will try and highlight specific issues as well as overall themes of problems encountered.

The goal here is to identify all the problems in one place so solutions can be proposed to improve things moving forward.

How `k8s-workloads/gitlab-com` works currently

Currently the process through which the helmfile tool determines what the desired state of the Kubernetes manifests should be in a specific environment/cluster, is by following the process outlined in the picture below

This happens for every single environment/cluster combination, for every single pipeline that is run, regardless of its source (a MR merged into master, release-tools running a pipeline, or a user running a pipeline manually). Each of the boxes in blue is an external data source, which can change independently at any time outside of the data stored in the git repo (gitlab-com) or outside of any upstream pipeline.

This means when attempting to determine if a change is safe and expected, we have to hope that none of the other blue boxes (external sources) have changed in the meantime. Also when something does go wrong (like during an incident), we have to manually try and go through and determine what external data sources may have introduced changes unexpectedly (then go source the history of those systems to find who changed what and why).

Finally, this process happens everytime a pipeline is run, meaning that a pipeline run for a merge request (to determine if the change in the MR is safe), may give different results when that pipeline is merged and applied to all environments.

User experience problems

1. Diffs on merge requests can often show unrelated changes (changes that haven't been applied)

Because we rely on using helm-diff combined with runtime execution of pulling in external data (secrets, chef data, etc) when attempting to determine what changes have happened to our manifests as part of your change/MR, depending on timing, there is a chance for unexpected changes to be displayed, either due to unapplied changes, or changes happening in systems outside your control. This can be frustrating to diagnose or understand, and leaves confusion as to the best way forward. It also means that you never really feel 100% confident that your change is only going to change what you expect.

Related issues:

#1255

2. Users outside of infrastructure are unable to see diff result output due to the potential for secrets to be leaked

Because the gitlab-com repository still has secrets involved, and also because we rely on pulling some information from chef (when we really shouldn't), we are unable to make the job logs that show the diff against the real infrastructure visible to end users outside of infrastructure (due to the risk of secret data being leaked). This really hampers usability as users trying to contribute from the outside can see that things have changed, but have no real confidence that the change is what they expect, they have to rely on a reviewer from infrastructure (which leads to a lot of back and forth)

Related issues:

#2288
#1255

3. Users outside of infrastructure are unable to get a simple picture of what Kubernetes objects we run where

If anyone who hasn't got production access to our clusters wishes to simply see what a simple setting is on any of our Kubernetes manifests (e.g. what is the liveness probe on gitlab-pages in canary?), they have to dig through the incredibly complicated setup we have of gotmpl, both in its use for helmfile, helm values, and then inside the Gitlab helm chart itself. This is a huge barrier to learning and understanding, and impedes users from being confident to do changes themselves (instead pushes them to ask someone from infrastructure to do it).

CI Problems

4. When master breaks on the repository, we have no notifications that it happened, and due to long execution time, people forgot to confirm the pipeline rollout is complete

It's often the case that a user rolls out a change to 1 or more environments, because our CI pipelines take so long, they don't sit there monitoring them the whole time (understandable). If a pipeline does break, the only notification of this breaking is an email to the user saying the pipeline has failed. However, broken master pipelines are of critical importance, as they will break auto-deploys, will get the cluster state different from the state in git, and cause everyone elses MRs to show unintended changes.

Related issues:

#2288

5. Our CI pipelines are very bloated due to the fact we cannot make intelligent decisions around which jobs to include depending on the change

As we have data coming from external sources (chef, pipeline variables) and cannot accurately determine how changing 1 file might impact multiple environments, we have to default to all our CI pipelines basically attempting to run across all environments. This is an extremely inefficient approach and greatly increases CI execution time, complexity, and can often confuse users or stall rollout of things needlessly (e.g. a QA job executes in staging despite nothing in staging being changed).

Related issues:

#1900 (closed)

6. If a CI job is cancelled while helm is running, helm is left in an unclean state which needs to be manually cleaned up

As helm relies on storing a copy of its own state inside Kubernetes secret objects, it uses these objects as a "lock" of sorts to determine if an upgrade/downgrade is already in progress. When we rollout something with the gitlab-com repo, it uses helmfile and then helm underneath. When you cancel a CI job (because you no longer wish to continue rolling out that change), the termination of the helm process leaves helm in a bad state, where any subsequent pipelines or execution of helm give the error Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress. This requires an SRE to manually intervene and reset the helm state to something valid, so pipelines can continue. This leads to ambiguity as well around what is the actual state of the manifests running in the cluster in such a case.

Related issues:

production#6400 (closed)

7. Helms detection of problems and rollback is very slow (if it works at all), and often gets us into a state needing cleanup

We use the atomic and wait flags for helm, which means that helm execution itself will monitor and block until it determines that the rollout of the Kubernetes manifests are complete. However, helms logic around determining if a rollout is succeeding or failing is very simplistic, very opaque to us (we can't tell what's going on), and it covers a very small set of supported Kubernetes resources, with no ability for us to add our own custom logic around determining if a rollout is complete or not.

On top of this, when something does go wrong, we actually don't fail quickly, but rather helm will wait a full 1 hour (our configured timeout) for the rollout to finish (it won't), and then after that 1 hour, it will rollback to the previous state. This is an exceptionally long time to determine something has gone wrong, and rolling back this state leads to skew between the live state in cluster and what is in git, further confusing things (and if users cancel the job pre-maturely, we get other problems, see "If a CI job is cancelled while helm is running, helm is left in an unclean state which needs to be manually cleaned up")

8. Pipeline execution time is very long due to the slowness of helm/helmfile

Our pipelines are currently setup to perform a helmfile diff and comment the output on the MR, and then do a helmfile apply (which also does a helmfile diff first before doing the apply). Each "diff/apply" operation takes approximately 3 minutes to complete (often longer), meaning we often have at least 6 minutes of deadtime just to apply a change (longer if execution is laggy). This is at the point where we have merged an MR and want to actually apply the change.

Repository problems

9. We have no easily viewable central ledger of changes and what order they happened in

This is a larger issue than just this repo, but as this repo is of the highest importance, this problem is most present here. Changes deployed by this repo can happen three ways

When you merge and MR and the pipeline runs on master
When someone manually runs a pipeline on master, and data from external sources is changed/pulled in
When auto-deploy runs a pipeline on master, passing in different variables to the pipeline

Because of this, there is no clear, concise history of what changes happened, and and by who. The best you can do is look through the pipeline history on ops.gitlab.net, however even that does not give a clear picture because pipelines can interleave. The other option is looking at the events log index here however that is hard to match state changes (you have to go through each pipeline, and figure out where it came from).

10. Diff jobs showing actual changes on MRs can get stale, requiring constant rebasing in order to be up to date and be positive you are only changing what you expect

As whenever an MR is merged, and pipeline is run against all environments potentially bringing in unexpected changes from external sources, in order to maximise safety, and give an accurate review of an MR, you have to try and run a pipeline "as close as possible" to review time/merge time. This is so the running state you are examining with your change in the MR is combined to give you the most accurate picture of what will happen. This causes issues with MRs that are open for long periods of time, which does happen a lot due to GitLabs async work environment.

11. Unable to do any static analysis or other preemptive linting/validation of manifests in CI jobs

A smaller point is that as we have no way of easily getting the manifests for our Kubernetes deployments in an environment, we aren't able to leverage GitLab CI and static analysis tools (as well as GitLab product features) to provide us more proactive ways of determining problems.

Some examples of this are using kubeval to validate invalid Kubernetes manifests, or using pluto to determine Kubernetes manifests using deprecated apis

Related Issues:

&675 (closed)

12. The tooling solution we have come up with to log output of a deployment (show us what's happening) is poor and doesn't work sometimes

This is a known issue due to the fact that the helm execution run by helmfile does not give us any logging output, and as such we are unable to get any information about the state of a deployment from helm itself. The best we can do is a workaround which is running a bash script alongside the execution of helm to try and find when deployments have changed and track their progress. This bash script however is not perfect and often leaves us with no visibility at all

Related issues:

#2104 (closed)

13. In order to avoid runtime execution of Gitlab.com, we have a manual band-aid job in place to block connections to Gitlab.com which is a burden

In order to avoid a situation where we were unable to deploy new code for GitLab.com due to our deployment tooling depending on GitLab.com being available, we opened issue #227 (closed) and did work in gitlab-com/gl-infra/k8s-workloads/gitlab-com!744 (merged) to make our CI pipelines temporarily disable access to GitLab.com while running. The way this work is done is quite brittle and error prone, not to mention it's a symptom of a larger problem, that is our deployment model with helm/helmfile relies on runtime execution of data from external sources in order to work.

14. We do secrets management for GitLab inside this repo

We run into a lot of considerations, complications, and restrictions around the fact we have to be careful of this repository due to the fact that it does secret management for GitLab.com here. This epic should alleviate some of these issues, however we need to push forward on gitlab-org&6060 and ensure that no configuration which is deemed potentially secret is done via environment variables anymore (a recent example where this was introduced was gitlab-com/gl-infra/k8s-workloads/gitlab-com!1615 (merged)

15. We pull some data from chef when there is no reason to, it's just leveraging a quirk of our current setup for safety

We can see in the file values-from-external-sources.yaml.gotmpl that we have some environment variables for the rails pods which actually contain secret data, despite the fact that we don't store them in Kubernetes secrets at all. Normally, we would actually be unable to have this data in a public repo at all, the only reason this works is because of the left over technical debt where some values were pulled from chef. Now that chef is no longer being used for these rail components, we still pull the data from chef, unnecessarily. This can't be removed until the product fixes epic &672 (closed)

Related issues:

Auto Deploy problems

16. Changes outside of auto-deploys can block auto-deploys if left unapplied (e.g. due to a job in the apply pipeline failing)

Because our current pipelines are not blocking on failure (e.g. if one pipeline fails, other pipelines can still execute), combined with the fact that we have no automated rollback of the state in Git, when a pipeline fails on master, all other pipelines (either from merged MRs or auto-deploy) will continue to run and fail, due to the problem in the first pipeline. We also have no monitoring or alerting when a pipeline on master fails, meaning that outstanding changes which cause failures are often picked up first being release managers monitoring auto deploy pipelines, leading to confusion and loss of productivity due to having to determine where the failed changes came from, who made the change and why, and manually reverting the failed change.

17. Auto-deploy pipelines can actually interleave with configuration change pipelines on a per cluster basis

Because the only locking mechanism we have on our pipelines is around environments (using the Gitlab environments feature), it's entirely possible for multiple pipelines applying changes to be running at the same time. They won't deploy a change to the same cluster at the same time, but as we have up to 4 clusters per environment, and applying a change can take minutes (or longer), it's very possible to have clashes where changes were applied in the middle of an auto-deploy, but simply interleaved on a per cluster basis, making it very difficult to actually determine exactly what changed when (and if an issue is caused by the auto-deploy or the configuration change)

Related issues:

#1190

18. The version of components that is run is passed to pipelines as CI variable, meaning we have no concrete history in-repo of what version was upgraded when

Similar to "We have no easily viewable central ledger of changes and what order they happened in", if we consider the version of GitLab containers to run in each environment as a "configuration change" (as technically they are, we are just changing the value of the pod image, rather than some other configuration setting), then we do have no concrete tracking of what version of each container in each pod changed at which point in time. All that information is stored in CI environment variables, and it's extremely difficult to determine when those variables have changed, who by, and what pipeline initially applied that change (the best we can do is look at pipeline history and GitLab logs).

Requirements for solution

There are no external sources of data consulted during pipeline execution time
Any user should be able to easily determine the current state of the Kubernetes manifests in any environment
Any user should be able to make a change and determine exactly what will change to the Kubernetes manifests for a specific environment, before submitting an MR
The CI pipeline for any given change should run in complete isolation with no overlap
The CI pipeline for any given change should only have the jobs needed to roll out that change in the environments it targets
There is a single CI pipeline experience and expectation for all types of changes (either config changes or deployment version changes)
There should be a single interface for any users or automated systems to make any changes to Kubernetes manifests for an environment, weather it's a configuration change, chart bump, or rollout of a new version of Gitlab
Similar to the point above, there should be a single unified place to see an accurate history of all changes to Kubernetes deployment manifests, and who changed them at what time (and in what order)

Edited Apr 14, 2022 by Amy Phillips

Discussion: Issues with the current `k8s-workloads/gitlab-com` repository

How k8s-workloads/gitlab-com works currently

User experience problems

1. Diffs on merge requests can often show unrelated changes (changes that haven't been applied)

2. Users outside of infrastructure are unable to see diff result output due to the potential for secrets to be leaked

3. Users outside of infrastructure are unable to get a simple picture of what Kubernetes objects we run where

CI Problems

4. When master breaks on the repository, we have no notifications that it happened, and due to long execution time, people forgot to confirm the pipeline rollout is complete

5. Our CI pipelines are very bloated due to the fact we cannot make intelligent decisions around which jobs to include depending on the change

6. If a CI job is cancelled while helm is running, helm is left in an unclean state which needs to be manually cleaned up

7. Helms detection of problems and rollback is very slow (if it works at all), and often gets us into a state needing cleanup

8. Pipeline execution time is very long due to the slowness of helm/helmfile

Repository problems

9. We have no easily viewable central ledger of changes and what order they happened in

10. Diff jobs showing actual changes on MRs can get stale, requiring constant rebasing in order to be up to date and be positive you are only changing what you expect

11. Unable to do any static analysis or other preemptive linting/validation of manifests in CI jobs

12. The tooling solution we have come up with to log output of a deployment (show us what's happening) is poor and doesn't work sometimes

13. In order to avoid runtime execution of Gitlab.com, we have a manual band-aid job in place to block connections to Gitlab.com which is a burden

14. We do secrets management for GitLab inside this repo

15. We pull some data from chef when there is no reason to, it's just leveraging a quirk of our current setup for safety

Auto Deploy problems

16. Changes outside of auto-deploys can block auto-deploys if left unapplied (e.g. due to a job in the apply pipeline failing)

17. Auto-deploy pipelines can actually interleave with configuration change pipelines on a per cluster basis

18. The version of components that is run is passed to pipelines as CI variable, meaning we have no concrete history in-repo of what version was upgraded when

Requirements for solution

How `k8s-workloads/gitlab-com` works currently