Increase the node pool size of the production GKE cluster
Production Change - Criticality 2 C2
| Change Objective | In order to add more workloads to the GKE cluster in production we are going to increase the node pool size |
|---|---|
| Change Type | C2 |
| Services Impacted | Registry |
| Change Team Members | @jarv @skarbek |
| Change Severity | ~S3 |
| Buddy check | @skarbek |
| Tested in staging | We have tested the same procedure on staging and confirmed there is no interruption of service |
| Schedule of the change | 2019-09-26 12:00 UTC |
| Duration of the change | 30 minutes |
| Detailed steps for the change. Each step must include: | - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step |
Summary
This change is necessary so we can add more workloads to the production GKE cluster. The following procedure has already been tested on preprod, we will be running this procedure on staging and then production.
Pods are evicted and workloads are moved from the old instance pool, to the new instance pool. No service interruption is expected during this transition.
Monitoring
During the change we will be continuously testing the registry service both by manually pushing and pulling images and by monitoring the following dashboards:
- Logs - https://log.gitlab.net/goto/43fb211edc754b58d4e22f3142eaeb8f
- Application Metrics: https://dashboards.gitlab.net/d/CoBSgj8iz/application-info
- Pod Metrics: https://dashboards.gitlab.net/d/oWe9aYxmk/pod-metrics
- Registry service error ratios - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now&fullscreen&panelId=8
- General service metrics, registry - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now
Manual testing during the procedure
While we test in staging and production we will continuously pull a registry image.
Staging
while true; do \
echo "-----" | ts; \
docker pull registry.staging.gitlab.com/jarv/registry-test 2>&1 | ts; \
docker image rm registry.staging.gitlab.com/jarv/registry-test 2>&1 | ts; \
done
Production
while true; do \
echo "-----" | ts; \
docker pull registry.gitlab.com/jarv/registry-test 2>&1 | ts; \
docker image rm registry.gitlab.com/jarv/registry-test 2>&1 | ts; \
done
Procedure
-
Switch to the appropriate kubectx (staging or production) -
Add a new node pool in terraform node-pool-1that has the bigger instance size: -
Confirm there are new nodes in the pool
kubectl get nodes -o wide
-
Cordon node-pool-0
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do \
kubectl cordon "$node"; \
read -p "Node $node cordoned, enter to continue ..."; \
done
-
Run the following in a terminal window to monitor the pods
watch kubectl get pods --all-namespaces -o wide
-
Evict pods from node-pool-0
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do \
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node"; \
read -p "Node $node drained, enter to continue ..."; \
done
-
Confirm that with dashboards and manual testing that the registry service is healthy
Cleanup
In order to delete the old pool it will require some manual tasks to ensure clean terraform runs
tf state rm module.gitlab-gke.google_container_node_pool.node_pool[0]
tf state rm module.gitlab-gke.google_container_node_pool.node_pool[1]
-
Import the node pool into terraform state
tf import module.gitlab-gke.google_container_node_pool.node_pool[0] gitlab-production/us-east1/gprd-gitlab-gke/node-pool-20190927
-
Remove node-pool-0from terraform, and confirm a clean state
tf plan -target module.gitlab-gke
Rollback
-
To allow workloads to be scheduled on the old node pool:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do k uncordon "$node"; done
Edited by John Jarvis
