Deprecate/remove `ALLOW_K8S_FAILURE` from release-tools pipelines
Summary
In incident production#6400 (closed) the flag ALLOW_K8S_FAILURE
was used to basically bypass the Kubernetes failure and keep deploying.
I suspect this flag was added way back when we first started doing deployments in Kubernetes, and that ignoring the k8s failures to keep the pipelines moving made sense.
There are also a set of valid use cases when we might want to use it such as when a GKE cluster in an environment is down and we wish to keep deploying to the other clusters. In this case the job for the down cluster will fail, while the others succeed.
We should change the message in release tools upon pipeline failure from the current message, to instead be a link to a runbook which gives some instructions on what to do when the k8s pipeline is failing, also highlighting the following
- In what specific circumstances the
ALLOW_K8S_FAILURE
flag could be used (when you know due to an external reason one or more of the jobs in thegitlab-com
pipeline will fail). - What to watch out for when using the
ALLOW_K8S_FAILURE
(e.g. you are acknowledging that using this flag could mean that no the release is not deployed to any part of the environment at all). This should highlight manually checking thegitlab-com
pipeline to ensure the jobs you expected to pass do indeed pass - Strongly consider doing a merge request to the
gitlab-com
repo instead changing the pipeline setup (e.g. disabling jobs for a specific cluster that is down) instead of using the flag instead.
Related Incident(s)
Originating issue(s):production#6400 (closed)
Desired Outcome/Acceptance criteria
-
Create runbook for what to do when the gitlab-com
deployment pipeline fails -
Update deployer
job that triggersgitlab-com
to change the error message to point to the runbook instead of directly calling out theALLOW_K8S_FAILURE
flag
Associated Services
- Release tools
Corrective Action Issue Checklist
-
link the incident(s) this corrective action arose out of -
give context for what problem this corrective action is trying to prevent from re-occurring -
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
assign a priority (this will default to 'priority::4')