Skip to content

Increase the node pool size of the production GKE cluster

Production Change - Criticality 2 C2

Change Objective In order to add more workloads to the GKE cluster in production we are going to increase the node pool size
Change Type C2
Services Impacted Registry
Change Team Members @jarv @skarbek
Change Severity ~S3
Buddy check @skarbek
Tested in staging We have tested the same procedure on staging and confirmed there is no interruption of service
Schedule of the change 2019-09-26 12:00 UTC
Duration of the change 30 minutes
Detailed steps for the change. Each step must include: - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step

Summary

This change is necessary so we can add more workloads to the production GKE cluster. The following procedure has already been tested on preprod, we will be running this procedure on staging and then production.

Pods are evicted and workloads are moved from the old instance pool, to the new instance pool. No service interruption is expected during this transition.

Monitoring

During the change we will be continuously testing the registry service both by manually pushing and pulling images and by monitoring the following dashboards:

Manual testing during the procedure

While we test in staging and production we will continuously pull a registry image.

Staging

while true; do  \
  echo "-----" | ts; \
  docker pull registry.staging.gitlab.com/jarv/registry-test 2>&1 | ts; \
  docker image rm registry.staging.gitlab.com/jarv/registry-test 2>&1 | ts; \
done

Production

while true; do  \
  echo "-----" | ts; \
  docker pull registry.gitlab.com/jarv/registry-test 2>&1 | ts; \
  docker image rm registry.gitlab.com/jarv/registry-test 2>&1 | ts; \
done

Procedure

kubectl get nodes -o wide
  • Cordon node-pool-0
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do \
  kubectl cordon "$node"; \
  read -p "Node $node cordoned, enter to continue ..."; \
done
  • Run the following in a terminal window to monitor the pods
 watch kubectl get pods --all-namespaces -o wide
  • Evict pods from node-pool-0
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do \
  kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node"; \
  read -p "Node $node drained, enter to continue ..."; \
done
  • Confirm that with dashboards and manual testing that the registry service is healthy

Cleanup

In order to delete the old pool it will require some manual tasks to ensure clean terraform runs

  • Delete node-pool-0 manually in the console Screen_Shot_2019-09-26_at_12.28.34_PM

  • Delete the node pools from terraform state

tf state rm module.gitlab-gke.google_container_node_pool.node_pool[0]
tf state rm module.gitlab-gke.google_container_node_pool.node_pool[1]
  • Import the node pool into terraform state
tf import module.gitlab-gke.google_container_node_pool.node_pool[0] gitlab-production/us-east1/gprd-gitlab-gke/node-pool-20190927
  • Remove node-pool-0 from terraform, and confirm a clean state
tf plan -target module.gitlab-gke

Rollback

  • To allow workloads to be scheduled on the old node pool:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do k uncordon "$node"; done
Edited by John Jarvis