Deprecate/remove `ALLOW_K8S_FAILURE` from release-tools pipelines

Summary

In incident production#6400 (closed) the flag ALLOW_K8S_FAILURE was used to basically bypass the Kubernetes failure and keep deploying.

I suspect this flag was added way back when we first started doing deployments in Kubernetes, and that ignoring the k8s failures to keep the pipelines moving made sense.

There are also a set of valid use cases when we might want to use it such as when a GKE cluster in an environment is down and we wish to keep deploying to the other clusters. In this case the job for the down cluster will fail, while the others succeed.

We should change the message in release tools upon pipeline failure from the current message, to instead be a link to a runbook which gives some instructions on what to do when the k8s pipeline is failing, also highlighting the following

In what specific circumstances the ALLOW_K8S_FAILURE flag could be used (when you know due to an external reason one or more of the jobs in the gitlab-com pipeline will fail).
What to watch out for when using the ALLOW_K8S_FAILURE (e.g. you are acknowledging that using this flag could mean that no the release is not deployed to any part of the environment at all). This should highlight manually checking the gitlab-com pipeline to ensure the jobs you expected to pass do indeed pass
Strongly consider doing a merge request to the gitlab-com repo instead changing the pipeline setup (e.g. disabling jobs for a specific cluster that is down) instead of using the flag instead.

Related Incident(s)

Originating issue(s):production#6400 (closed)

Desired Outcome/Acceptance criteria

Create runbook for what to do when the gitlab-com deployment pipeline fails
Update deployer job that triggers gitlab-com to change the error message to point to the runbook instead of directly calling out the ALLOW_K8S_FAILURE flag

Associated Services

Release tools

Corrective Action Issue Checklist

link the incident(s) this corrective action arose out of
give context for what problem this corrective action is trying to prevent from re-occurring
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
assign a priority (this will default to 'priority::4')

Edited Jul 21, 2022 by John Skarbek