Config changes to Kubernetes can be rolled out in an unsafe manner
Configuration changes to the [gitlab-com] repo can be accidentally rolled out unknowingly. Let's use the following real life example:
- 17:13 - Merged - gitlab-com/gl-infra/k8s-workloads/gitlab-com!1490 (merged) - a change to only preprod and staging
- 17:29 - Merged - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1242 - adds a new fileserver to our GitLab configuration
- 17:57 - Pipeline Begins Running on Production - https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/1006889 - this is where the file server changes were added to production, but were NOT part of the appropriate MR
- 18:02 - Auto-Deploy fails because it noticed config changes alongside the deployment: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/1006937
So here, we have an MR that targeted only non prod environments, but because our pipeline is not smart enough, it continued to make "noop" changes to production. Unfortunately for us, helmfile still queries chef for things regardless of the fact that nothing is staged for production. This means that by the time the pipeline go around to running on production, this file server was added, in an uncontrolled manner to our Kubernetes configurations. This also impacted auto-deploy.
This rendered an Engineer following our docs for this procedure redundant: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1492 (closed) and thus the MR was closed as unactionable.
Luckily for us, no harm was done because auto-deploy was simply retried and was able to succeed because some other pipeline was pushing that change. The bad thing about all of this, we are pushing configuration changes that impact the availability of our services in an uncontrolled manner. This increases the risk of not knowing when a change is landing in production.