Rebuild Staging Zonal Cluster us-east1-c
Production Change
Change Summary
Delivery ~"team::System" is looking to complete a cluster rebuild as a part of our OKR for Q4. Our goal here is to evaluate how successful we are with our newly built runbook and observe the painful areas of this runbook and file issues to address fixing issues and making this procedure easier in the future. The steps in this procedure are slight modifications of the runbook located here: https://gitlab.com/gitlab-com/runbooks/-/blob/b99a2955507481f8fc03abf1637d5a49df684403/docs/kube/k8s-cluster-rebuild.md
Change Details
- Services Impacted - All services
- Change Technician - @skarbek
- Change Reviewer - @ahyield
- Time tracking - 6 hours
- Downtime Component - none
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 6 hours
-
Set label changein-progress /label ~change::in-progress
1- Skipping cluster deployment
-
Make sure the Auto Deploy pipeline is not active and no active deployment is happening on the environment -
Identify the name of the cluster we need to skip, and we need to use the full name of the GKE cluster, gstg-us-east1-c
. -
Then we need to set the environment variable CLUSTER_SKIP
to the name of the clustergstg-us-east1-c
. This needs to be placed on the ops instance where the pipelines run.
2- Removing traffic
-
Create a silence on alerts.gitlab.net using the following example filter: alertname=SnitchHeartBeat
cluster=gstg-us-east1-c
2.a Removing traffic from canary
We do this so we don't over-saturate canary when the gstg
cluster goes down, canary doesn't have the same capacity as the main stage.
-
We start with setting all the canary backends to MAINT
mode
$> declare -a CNY=(`./bin/get-server-state -z c gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
$> for server in $CNY
do
./bin/set-server-state -f -z c gstg maint $server
done
-
Fetch all canary backends to validate that they are put to MAINT
$> ./bin/get-server-state -z c gstg | grep -E 'cny|canary'
2.b Removing traffic from main stage
-
We now want to remove traffic targeting our main stage for this zone. The below command will instruct HAProxies that live in the same zone and sets the backends to MAINT
:
$> declare -a MAIN=(`./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c' | awk '{ print substr($3,1,length($3)-1) }'| tr '\n' ' '`)
$> for server in $MAIN
do
./bin/set-server-state -f -z c -s 60 gstg maint $server
done
-
Fetch all main stage backends to validate that they are put to MAINT
./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c'
3- Replacing cluster using Terraform
3.a Setting up the tooling
In order to work with terraform and config-mgmt
repo, you can refer to the getting started to setup the needed tooling and quick overview of the steps needed.
3.b Pull latest changes
- We need to make sure we pulled the latest changes from the
config-mgmt
repository before executing any command.
3.c Executing terraform
-
Perform a plan
and validate the changes before running theapply
tf plan -replace="module.gke-us-east1-c.google_container_cluster.cluster"
You should see:
Terraform will perform the following actions:
# module.gke-us-east1-c.google_container_cluster.cluster will be replaced, as requested
-
Then we apply
the change
tf apply -replace="module.gke-us-east1-c.google_container_cluster.cluster"
3.d New cluster config setup
After terraform command is executed we will have a brand new cluster, we need to orient our tooling to use the new cluster.
-
We start with glsh
we need to runglsh kube setup
to fetch the cluster configs. -
Validate we can use the new context and kubectl
works with the cluster:
$> glsh kube setup
$> glsh kube use-cluster gstg-us-east1-c
$> kubectl get pods --all-namespaces
-
Update the new cluster's apiServer
IP to the tanka repository -
Configure Vault Secrets responsible for CI configurations within config-mgmt
:
CONTEXT_NAME=$(kubectl config current-context)
KUBERNETES_HOST=$(kubectl config view -o jsonpath='{.clusters[?(@.name == "$CONTEXT_NAME")].cluster.server}')
SA_SECRET=$(kubectl --namespace external-secrets get serviceaccount external-secrets-vault-auth -o jsonpath='{.secrets[0].name}')
SA_TOKEN=$(kubectl --namespace external-secrets get secret ${SA_SECRET} -o jsonpath='{.data.token}' | base64 -d)
CA_CERT=$(kubectl --namespace external-secrets get secret ${SA_SECRET} -o jsonpath='{.data.ca\.crt}' | base64 -d)
vault kv put ci/ops-gitlab-net/config-mgmt/vault-production/kubernetes/us-east1-c host="${KUBERNETES_HOST}" ca_cert="${CA_CERT}" token="${SA_TOKEN}"
-
From the config-mgmt/environments/vault-production
repo, we need to runtf apply
it will show us that there's config change for the cluster we replaced, then we apply this change so vault knows of the new cluster.
cd config-mgmt/environments/vault-production
tf apply
4.a- Deploying Workloads
-
First we bootstrap our cluster with required CI configurations:
- From gitlab-helmfiles repo
- pull latest changes
cd releases/00-gitlab-ci-accounts
helmfile -e gstg-us-east1-c apply
Then we can complete setup via our existing CI pipelines:
-
From gitlab-helmfiles CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild -
From gitlab-secrets CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild -
From tanka-deployments CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild -
After installing the workloads, run kubectl get pods --all-namespaces
and check if all workloads are working correctly before going to the next step.
4.b Deploy Prometheus Rules
-
Browse to Runbooks CI Pipelines -
Find the latest Pipeline executed against the default branch -
Retry the deploy-rules-non-production
job
4.c Deploying gitlab-com
-
Remove the CLUSTER_SKIP
variable from ops instance -
In order to deploy gitlab-com, simply we need to run the latest successful auto-deploy
job. Go to the announcements channel and check latest successful job, and re-run the Kubernetes job for the targeted cluster. -
Spot check the cluster to validate the Pods are coming online and remain in a Running state - Connect to our replaced cluster
glsh kube use-cluster gstg-us-east1-c
kubectl get pods --namespace gitlab
- Connect to our replaced cluster
4.d Verify we run the same version on all clusters
-
glsh kube use-cluster gstg-us-east1-b
-
in a separate window: kubectl get configmap gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
-
glsh kube use-cluster gstg-us-east1-c
-
]in a separate window: kubectl get configmap gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
-
Version from cluster c
matches version from clusterb
4.e Monitoring
-
In this dashboard we should see the numbers of the pods and containers of the cluster. -
Remove any silences that were created earlier -
Validate no alerts are firing related to this replacement cluster on the Alertmanager
5- Pushing traffic back to the cluster
-
We start with the main stage, we chose zone b
as an example.
$> declare -a MAIN=(`./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
$> for server in $MAIN
do
./bin/set-server-state -f -z c -s 60 gstg ready $server
done
-
Validate all main stage backends are on READY state
$> ./bin/get-server-state -z c gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-c'
-
Then we change canary backends to READY state
$> declare -a CNY=(`./bin/get-server-state -z c gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
$>for server in $CNY
do
./bin/set-server-state -f -z c -s 60 gstg ready $server
done
-
Validate all canary stage backends are on READY state
$> ./bin/get-server-state -z c gstg | grep -E 'cny|canary'
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Monitoring
Key metrics to observe
During the procedure there are key items to be viewed and reported on throughout the entirety of this Change Request
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.