gstg-us-east1-b Zonal cluster rebuild

Production Change

Change Summary

As part of &776 (closed), we aim to reconstruct a cluster on gstg in place delivery#2329 (closed). This CR has the steps of tearing down a cluster by removing the workloads, then deleting resources via Terraform while keeping the CI configs. Alerts that may fire will be collected. After tearing down, we will proceed boot up a replacement cluster leveraging Terraform and then deploying the workloads per each repo's method describe in the readme.

Change Details

Services Impacted - gstg-us-east1-b
Change Technician - @ahyield
Change Reviewer - @skarbek
Time tracking - ~5 h
Downtime Component - gstg-us-east1-b

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - ~300 mins

Communicate with SRE/RM of this maintenance

Set label changein-progress /label ~change::in-progress

CI configs change for cluster gstg-us-east1-b (skip the cluster)

https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/master/TROUBLESHOOTING.md#skipping-cluster-deployments

Add CLUSTER_SKIP variable with gstg-us-east1-b. This needs to be placed on our ops instance where the pipelines run: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/settings/ci_cd

Remove traffic from canary

We do this so we don't over-saturate CNY when the gstg cluster goes down, canary doesn't have the same capacity as the zonal clusters basically.

Fetch all canary/cny backends

→ ./bin/get-server-state -z b gstg | grep -E 'cny|canary'

set all canary/cny backends to maint

> declare -a CNY=(`./bin/get-server-state -z b gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)

>
for server in $CNY
do
./bin/set-server-state -f -z b -s 60 gstg maint $server
done

Fetch all canary/cny backends to validate the state.
Document the network is cut from CNY in comment section in this CR

→ ./bin/get-server-state -z b gstg | grep -E 'cny|canary'

Remove traffic from cluster gstg-us-east1-b

set all main backends in zone b to maint

./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }'

declare -a MAIN=(`./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }'| tr '\n' ' '`)

for server in $MAIN
do
./bin/set-server-state -f -z b -s 60 gstg maint $server
done

Fetch all main backends to validate the state. They all should be in MAINT
Document the network is cut from cluster b in comment section in this CR

./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b'

Replace the cluster using terraform

login to vault from a different terminal window https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/vault/usage.md#access
export needed vault configs

export VAULT_ADDR=https://vault.ops.gke.gitlab.net
export VAULT_PROXY_ADDR=socks5://localhost:18200

git pull latest changes on config-mgmt directory.
tf apply -replace="module.gke-us-east1-b.google_container_cluster.cluster"
glsh setup to fetch the new cluster configs.
Document the That the cluster is created and has no workloads in the comment section of this CR

Deploy Workloads to the cluster

Make sure the cluster is up to date with the latest changes on the other cluster

glsh kube use-cluster gstg-us-east1-b
in a separate window: kubectl get pod PODNAME -n gitlab -o jsonpath="{.spec.containers[0].image}"; echo
glsh kube use-cluster gstg-us-east1-c
in a separate window: kubectl get pod PODNAME -n gitlab -o jsonpath="{.spec.containers[0].image}"; echo
Version from cluster c matches version from cluster b

Add traffic back to the cluster

We start with main gstg-b


declare -a MAIN=(`./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)

for server in $MAIN
do
./bin/set-server-state -f -z b -s 60 gstg ready $server
done

Document That the cluster now accepting traffic in the comment section of this CR

Then cny

> declare -a CNY=(`./bin/get-server-state -z b gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)

>
for server in $CNY
do
./bin/set-server-state -f -z b -s 60 gstg ready $server
done

Document That CNY now accepting traffic in the comment section of this CR
Remove SKIP_CLUSTER variable with gstg-us-east1-b. This needs to be removed from our ops instance where the pipelines run: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/settings/ci_cd
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

There're no rollback steps for such change, since the cluster is being replaced, we just need to the whole CR

Monitoring

Key metrics to observe

In this dashboard
- we should see the numbers of the pods and containers getting to 0 when we destroy the cluster.
- the numbers get high again when we reconstruct it.
We should also monitor the namespace kubectl get pods --all-namespaces
- Troubleshoot any workloads that fail to run

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Oct 24, 2022 by Ahmad Tolba