Rebuild Staging Zonal Cluster us-east1-c

Production Change

Change Summary

Delivery ~"team::System" is looking to complete a cluster rebuild as a part of our OKR for Q4. Our goal here is to evaluate how successful we are with our newly built runbook and observe the painful areas of this runbook and file issues to address fixing issues and making this procedure easier in the future. The steps in this procedure are slight modifications of the runbook located here: https://gitlab.com/gitlab-com/runbooks/-/blob/b99a2955507481f8fc03abf1637d5a49df684403/docs/kube/k8s-cluster-rebuild.md

Change Details

Services Impacted - All services
Change Technician - @skarbek
Change Reviewer - @ahyield
Time tracking - 6 hours
Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 6 hours

Set label changein-progress /label ~change::in-progress

1- Skipping cluster deployment

Make sure the Auto Deploy pipeline is not active and no active deployment is happening on the environment
Identify the name of the cluster we need to skip, and we need to use the full name of the GKE cluster, gstg-us-east1-c.
Then we need to set the environment variable CLUSTER_SKIP to the name of the cluster gstg-us-east1-c. This needs to be placed on the ops instance where the pipelines run.

2- Removing traffic

Create a silence on alerts.gitlab.net using the following example filter:
- alertname=SnitchHeartBeat
- cluster=gstg-us-east1-c

2.a Removing traffic from canary

We do this so we don't over-saturate canary when the gstg cluster goes down, canary doesn't have the same capacity as the main stage.

We start with setting all the canary backends to MAINT mode

$> declare -a CNY=(`./bin/get-server-state -z c gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)


$> for server in $CNY
do
./bin/set-server-state -f -z c gstg maint $server
done

Fetch all canary backends to validate that they are put to MAINT

$> ./bin/get-server-state -z c gstg | grep -E 'cny|canary'

2.b Removing traffic from main stage

We now want to remove traffic targeting our main stage for this zone. The below command will instruct HAProxies that live in the same zone and sets the backends to MAINT:

$> declare -a MAIN=(`./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c' | awk '{ print substr($3,1,length($3)-1) }'| tr '\n' ' '`)

$> for server in $MAIN
do
./bin/set-server-state -f -z c -s 60 gstg maint $server
done

Fetch all main stage backends to validate that they are put to MAINT

./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c'

3- Replacing cluster using Terraform

3.a Setting up the tooling

In order to work with terraform and config-mgmt repo, you can refer to the getting started to setup the needed tooling and quick overview of the steps needed.

3.b Pull latest changes

We need to make sure we pulled the latest changes from the config-mgmt repository before executing any command.

3.c Executing terraform

Perform a plan and validate the changes before running the apply

tf plan -replace="module.gke-us-east1-c.google_container_cluster.cluster"

You should see:

Terraform will perform the following actions:

  # module.gke-us-east1-c.google_container_cluster.cluster will be replaced, as requested

Then we apply the change

tf apply -replace="module.gke-us-east1-c.google_container_cluster.cluster"

3.d New cluster config setup

After terraform command is executed we will have a brand new cluster, we need to orient our tooling to use the new cluster.

We start with glsh we need to run glsh kube setup to fetch the cluster configs.
Validate we can use the new context and kubectl works with the cluster:

$> glsh kube setup
$> glsh kube use-cluster gstg-us-east1-c
$> kubectl get pods --all-namespaces

Update the new cluster's apiServer IP to the tanka repository
Configure Vault Secrets responsible for CI configurations within config-mgmt:

CONTEXT_NAME=$(kubectl config current-context)
KUBERNETES_HOST=$(kubectl config view -o jsonpath='{.clusters[?(@.name == "$CONTEXT_NAME")].cluster.server}')
SA_SECRET=$(kubectl --namespace external-secrets get serviceaccount external-secrets-vault-auth -o jsonpath='{.secrets[0].name}')
SA_TOKEN=$(kubectl --namespace external-secrets get secret ${SA_SECRET} -o jsonpath='{.data.token}' | base64 -d)
CA_CERT=$(kubectl --namespace external-secrets get secret ${SA_SECRET} -o jsonpath='{.data.ca\.crt}' | base64 -d)

vault kv put ci/ops-gitlab-net/config-mgmt/vault-production/kubernetes/us-east1-c host="${KUBERNETES_HOST}" ca_cert="${CA_CERT}" token="${SA_TOKEN}"

From the config-mgmt/environments/vault-production repo, we need to run tf apply it will show us that there's config change for the cluster we replaced, then we apply this change so vault knows of the new cluster.

cd config-mgmt/environments/vault-production
tf apply

4.a- Deploying Workloads

First we bootstrap our cluster with required CI configurations:

From gitlab-helmfiles repo
1. pull latest changes
2. cd releases/00-gitlab-ci-accounts
3. helmfile -e gstg-us-east1-c apply

Then we can complete setup via our existing CI pipelines:

From gitlab-helmfiles CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild
From gitlab-secrets CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild
From tanka-deployments CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild
After installing the workloads, run kubectl get pods --all-namespaces and check if all workloads are working correctly before going to the next step.

4.b Deploy Prometheus Rules

Browse to Runbooks CI Pipelines
Find the latest Pipeline executed against the default branch
Retry the deploy-rules-non-production job

4.c Deploying gitlab-com

Remove the CLUSTER_SKIP variable from ops instance
In order to deploy gitlab-com, simply we need to run the latest successful auto-deploy job. Go to the announcements channel and check latest successful job, and re-run the Kubernetes job for the targeted cluster.
Spot check the cluster to validate the Pods are coming online and remain in a Running state
- Connect to our replaced cluster glsh kube use-cluster gstg-us-east1-c
- kubectl get pods --namespace gitlab

4.d Verify we run the same version on all clusters

glsh kube use-cluster gstg-us-east1-b
in a separate window: kubectl get configmap gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
glsh kube use-cluster gstg-us-east1-c
]in a separate window: kubectl get configmap gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
Version from cluster c matches version from cluster b

4.e Monitoring

In this dashboard we should see the numbers of the pods and containers of the cluster.
Remove any silences that were created earlier
Validate no alerts are firing related to this replacement cluster on the Alertmanager

5- Pushing traffic back to the cluster

We start with the main stage, we chose zone b as an example.

$> declare -a MAIN=(`./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)

$> for server in $MAIN
do
./bin/set-server-state -f -z c -s 60 gstg ready $server
done

Validate all main stage backends are on READY state

$> ./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c'

Then we change canary backends to READY state

$> declare -a CNY=(`./bin/get-server-state -z c gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)

$>for server in $CNY
do
./bin/set-server-state -f -z c -s 60 gstg ready $server
done

Validate all canary stage backends are on READY state

$> ./bin/get-server-state -z c gstg | grep -E 'cny|canary'

Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

⚠ This is a destructive operation, once we tear down the cluster, we can ONLY move forward:warning:

Monitoring

Key metrics to observe

During the procedure there are key items to be viewed and reported on throughout the entirety of this Change Request

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Nov 29, 2022 by John Skarbek