Discussion: Pipeline locking scenarios for `gitlab-com`, what's the best level of locking

Current state

Currently the k8s-workloads/gitlab-com repo services three types of pipelines

Auto-deploy pipelines against a specific environment
Pipelines to roll out a change to one or more environments due to a change in repo
Pipelines running on a branch/merge request to determine what changes the MR will make against running state

Inside each pipeline, there are three steps

Doing a "diff" against the code in the branch (master for auto-deploy and repo-change), and reporting on the issue what Kubernetes objects will change as a result of the pipeline
Applying the Kubernetes manifests (does not happen in MR diff only pipelines)
Running QA tests (does not happen in auto-deploy pipelines as the auto-deploy pipeline itself runs QA)

Pipeline	Number of environments deployed to	diff against running state?	Apply/change running state?	QA in pipeline
Auto-deploy	1	yes	yes	no
Repo Change	1-3	yes	yes	yes
MR Diff pipelines	0	yes	no	no

Currently we do have locking, but it's at the environment level only for each individual job. What this means is that pipelines are able run simultaneously, and the following scenarios can happen

An auto-deploy to an environment and a configuration change to another environment can take place very close together (seconds apart) and any QA from either process will include both changes
An auto-deploy to an environment and configuration change to an environment can each run a QA pipeline and they can clash (Quality has confirmed multiple instances of the QA pipeline should not run at the same time)
A configuration change to all environments (in a single pipeline) gets held up by a deploy to staging, which means the configuration change gets delayed going to production
A diff job on an MR shows incorrect information because of scenario 3. (merged but unapplied changes due to pipeline execution being delayed).

Options for locking

Entire pipeline lock

In this scenario, we do a very simple lock where only 1 pipeline can ever run in the gitlab-com repository on ops.gitlab.net. This overall gives us the best, most clear isolation and will solve all issues. However, this would also mean the following examples are true

You can not make a configuration change or deploy to staging while a deploy to production is ongoing (which can take up to 35 minutes or more) likewise no changes to any other environment can happen while any auto-deploy to an environment is ongoing
If someone is rolling out a configuration change to all environments, auto-deploys will become blocked behind this change for up to an hour or more due to the fact that rolling out to prod and doing QA for pre and gstg can take that amount of time

Pipeline locking per environment

This is similar to what we currently do, but instead of having the lock done per job (allowing them to interleave), we would instead make the lock happen over the entire set of jobs for a specific environment (diff/apply). For auto-deploy jobs this should then become an "entire pipeline lock" as they are smaller

This doesn't give quite as much isolation as entire pipeline locking, however it does allow different actions to be taken to different environments. The following examples would be true

A change rollout across multiple environments could be interleaved with an auto-deploy, but not in the same environment. This could still delay the change pipeline from being completed until the auto deploy is completed
As the lock would extend over QA for the change as well, we should be able to accurately determine that the QA failure is related to the change (flakey tests and other issues not withstanding)

Do we automatically "unlock" on pipeline completion?

In the scenario where a pipeline fails (e.g. lets say that you make a configuration change that fails in pre and staging, and it fails in pre). We have two options

The lock is only for the duration of the pipeline. Even when the pipeline fails, we release the lock. The next pipeline will run and try to apply the breaking change to the environment again, and fail again, which leads to confusion (and in the case of auto-deploys, breaks them)
We don't automatically unlock on pipeline failure, meaning all other pipelines that interact with that environment will "back up" in a pending state until the pipeline that failed is fixed, or someone manually unlocks the environment so pipelines can continue

While not having manual unlocking might be the easiest option, that fact that with the current setup every subsequent pipeline will continue to apply the bad change (due to the declarative nature of Kubernetes) means that all we end up doing is breaking every later pipeline anyway. At least with manual unlock, people need to determine why it's still locked (the failure) and appropriately fix it before things continue.

If we do pipeline locking per environment, what's the worst case for an environment to be locked?

To answer this question, we have to look at the longest amount of time each environment can be locked for each type of pipeline. Visualising the locks in the diagram below shows us that without any additional changes, we do have the potential for gprd to be locked for a large amount of time during some circumstances

Points for discussion

Can we think of any other scenarios where the gitlab-com repository is used that hasn't been covered here?
If we move to environment locking across a whole pipeline (or across deploying a change and QA testing that change), are we ok with the potential for certain environments to be locked longer than others?
Do we need to consider re-arranging the pipeline itself as well as introduce locking? In particular for pipelines that roll out a change (which take the longest). Could this be a later iteration?

Proposed solution

In order to address the problems we have with the current setup, as well as continue to dogfood and improve upon GitLab features, we will implement a solution with the following

We will leverage the GitLab Environments and Deployments features
- We will use the Environments Dashboard on ops.gitlab.net (here to be the canonical source of gitlab-com environments. Potentially later on for other k8s-workloads repos as well
- We will use Protected environments to ensure that only specific users/roles can deploy to specific environments. This will pave the way for using deployment approvals
- We will use Resource groups per environment to ensure that only one environment is being touched at a time
- We will use Dynamic Child pipelines To not only provide consistent locking over an entire child pipeline, but also to make the jobs generated for a change more dynamic

Steps for new pipeline

Edited Jun 17, 2022 by Graeme Gillies