Create a new `gitlab-triage-reactive` project inside the `eng-productivity-57de876a` folder, and move the current infrastructure to it
Context
This item is part of the incident post-mortem: #1178 (comment 1201846600)
Triage-ops was down due to a cleanup error, which wiped out the GCP load balancer: #1178 (closed).
Goal
Discuss whether we should move triage-ops in its own GCP project.
Is the migration technically feasible?
We already use a separate GKE cluster for it anyways, so it should be a matter of recreating a GKE cluster in a separate GCP project and deploying triage-ops and its third-party dependencies there via Terraform (there will be some downtime for the load balancer to be created of ~5-10 minutes if all goes well, possibly a bit more).
Possible migration path
I propose the following first migration draft:
- Create the new GCP project
- Create the new GKE cluster (via Terraform) in that new project
- Reserve a new static IP in GCP to use for triage-ops domain
- Deploy the triage-ops dependencies in Terraform (e.g. cert-manager, nginx-ingress-controller). Create new resources in Terraform for this.
- Deploy triage-ops to the new cluster, and disable the background jobs to be sure that nothing will be processed.
- Ensure that triage-ops is available (cert-manager won't work until the DNS is switched, so just try to access it via the IP for now)
- When we are ready, change the DNS record to point to the new static IP
- Ensure that the TLS certificate is correctly issued by cert-manager
- Ensure that triage-ops is accessible via HTTPS
- Enable the background workers
- Re-enable the webhooks (see triage-ops RUNBOOK for the steps)
- Delete the old resources in Terraform in the old cluster.
Another variant would be to keep the same static IP (pros: easier to switch back and forth. cons: cannot test that triage-ops is accessible from the outside world)