gstg-us-east1-b Zonal cluster rebuild
Production Change
Change Summary
As part of &776 (closed), we aim to reconstruct a cluster on gstg in place delivery#2329 (closed). This CR has the steps of tearing down a cluster by removing the workloads, then deleting resources via Terraform while keeping the CI configs. Alerts that may fire will be collected. After tearing down, we will proceed boot up a replacement cluster leveraging Terraform and then deploying the workloads per each repo's method describe in the readme.
Change Details
- Services Impacted - gstg-us-east1-b
- Change Technician - @ahyield
- Change Reviewer - @skarbek
-
Time tracking -
~5 h
- Downtime Component - gstg-us-east1-b
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - ~300 mins
Communicate with SRE/RM of this maintenance
-
Set label changein-progress /label ~change::in-progress
CI configs change for cluster gstg-us-east1-b (skip the cluster)
-
Add CLUSTER_SKIP
variable withgstg-us-east1-b
. This needs to be placed on our ops instance where the pipelines run: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/settings/ci_cd
Remove traffic from canary
We do this so we don't over-saturate CNY when the gstg cluster goes down, canary doesn't have the same capacity as the zonal clusters basically.
-
Fetch all canary/cny
backends
→ ./bin/get-server-state -z b gstg | grep -E 'cny|canary'
-
set all canary/cny
backends tomaint
> declare -a CNY=(`./bin/get-server-state -z b gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
>
for server in $CNY
do
./bin/set-server-state -f -z b -s 60 gstg maint $server
done
-
Fetch all canary/cny
backends to validate the state. -
Document the network is cut from CNY in comment section in this CR
→ ./bin/get-server-state -z b gstg | grep -E 'cny|canary'
Remove traffic from cluster gstg-us-east1-b
-
set all main
backends in zoneb
tomaint
./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }'
declare -a MAIN=(`./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }'| tr '\n' ' '`)
for server in $MAIN
do
./bin/set-server-state -f -z b -s 60 gstg maint $server
done
-
Fetch all main
backends to validate the state. They all should be in MAINT -
Document the network is cut from cluster b
in comment section in this CR
./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b'
Replace the cluster using terraform
-
login to vault from a different terminal window https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/vault/usage.md#access -
export needed vault configs
export VAULT_ADDR=https://vault.ops.gke.gitlab.net
export VAULT_PROXY_ADDR=socks5://localhost:18200
-
git pull
latest changes onconfig-mgmt
directory. -
tf apply -replace="module.gke-us-east1-b.google_container_cluster.cluster"
-
glsh setup
to fetch the new cluster configs. -
Document the That the cluster is created and has no workloads in the comment section of this CR
Deploy Workloads to the cluster
-
Validate access to the cluster via kubectl
- https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/kube/k8s-oncall-setup.md#kubernetes-api-access -
From gitlab-secrets
directory pull latest changes then runhelmfile -f helmfile.yaml -e gstg-us-east1-b apply
-
From gitlab-helmfiles
directory pull latest changes then runhelmfile -f helmfile.yaml -e gstg-us-east1-b apply
-
From tanka-deployments
directory pull latest changes then runtk apply --name gstg-us-east1-b environments/fluentd-elasticsearch
-
Monitor the workloads for other namespaces being created. for e.x: We should see Monitoring namespaces has running workloads like prometheus -
From gitlab-com
directory pull latest changes then runhelmfile -f helmfile.yaml -e gstg-us-east1-b apply
-
Monitor the workloads for gitlab
namespace being created, we should see GitLab's components being running. -
check the dashboard and make sure workloads are deployed: #7598 (closed) -
check workloads on gcp console that all are green and working -
Document That the cluster now has workloads, in the comment section of this CR
Make sure the cluster is up to date with the latest changes on the other cluster
-
glsh kube use-cluster gstg-us-east1-b
-
in a separate window: kubectl get pod PODNAME -n gitlab -o jsonpath="{.spec.containers[0].image}"; echo
-
glsh kube use-cluster gstg-us-east1-c
-
in a separate window: kubectl get pod PODNAME -n gitlab -o jsonpath="{.spec.containers[0].image}"; echo
-
Version from cluster c
matches version from clusterb
Add traffic back to the cluster
We start with main gstg-b
declare -a MAIN=(`./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
for server in $MAIN
do
./bin/set-server-state -f -z b -s 60 gstg ready $server
done
-
Document That the cluster now accepting traffic in the comment section of this CR
Then cny
> declare -a CNY=(`./bin/get-server-state -z b gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
>
for server in $CNY
do
./bin/set-server-state -f -z b -s 60 gstg ready $server
done
-
Document That CNY now accepting traffic in the comment section of this CR -
Remove SKIP_CLUSTER
variable withgstg-us-east1-b
. This needs to be removed from our ops instance where the pipelines run: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/settings/ci_cd -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
There're no rollback steps for such change, since the cluster is being replaced, we just need to the whole CR
Monitoring
Key metrics to observe
- In this dashboard
- we should see the numbers of the pods and containers getting to 0 when we destroy the cluster.
- the numbers get high again when we reconstruct it.
- We should also monitor the namespace
kubectl get pods --all-namespaces
- Troubleshoot any workloads that fail to run
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.