Investigate and document sandbox upgrade process on the Instrumentor level

(This issue was rescoped to focus the investigative effort on mainly on the Instrumentor level of the sandbox upgrade process, in order to separate out a more thorough investigation into GET in Investigate GET deployments for Cloud Native Hy... (#19595 - closed) -- context on #19557 (comment 1531980408))

Problem Statement

Currently, upgrading the Dedicated tenant services requires scheduled maintenance windows, as the upgrade process has a downtime. Referring to the dedicated architecture, Instrumentor provides Amp with definitions and logic for provisioning, configuring, and decommissioning tenants (EKS cluster and VMs).

The goal of this issue is to investigate and document the upgrade process of dedicated sandbox instance(s) on the Intrumentor level, in order to understand the cause of downtime.

We want to document the answers to these questions:

How do we start a sandbox and run an upgrade?
During an upgrade, How/where/when does Instrumentor interact with GET?
How does Amp interact with Instrumentor?
How does the maintenance window work?
How does QA work with Instrumentor? Is it run during or after the maintenance window?

Update: They were answered in the Delivery guide to dedicated documentation

Out of Scope

Specifics of GET's functionality, as that would be more thoroughly investigated as part of Investigate GET deployments for Cloud Native Hy... (#19595 - closed). This issue should only address GET's interaction with Instrumentor.

Context/Examples

We have some onboarding material and context on the dedicated architecture compiled on delivery guide to dedicated documentation.

The following are some context from problem solving groupdelivery has done in the past to eliminate downtime from auto-deploy clusters that may be helpful as a reference:

We have set up nginx ingress load balancer to do readiness checks: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/master/releases/gitlab-extras/values.yaml.gotmpl?ref_type=heads#L53-95.
There are more nginx configurations like this one https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/master/releases/gitlab-extras/values.yaml.gotmpl?ref_type=heads#L21 for nginx.ingress.kubernetes.io/service-upstream: "true" which is a "safety mechanism to assist with leveraging Kubernetes to prevent traffic from potentially being sent to pods that may be transitioning to a terminated state" (documentation here explicitly mentions that it "can be desirable for things like zero-downtime deployments").

There is also a past thread gitlab-org/gitlab!96163 (comment 1142318833) with more context and discussion on how to make it no/less downtime.

Exit Criteria

Answers to the questions asked in the Problem Statement section are documented (as part of "Delivery Guide to Dedicated" or a new nearby location) gitlab-org/release/docs!634 (merged)
Further issues are created to address problems and further investigation.

Edited Sep 13, 2023 by Jenny Kim