Increase the node pool size of the production GKE cluster

Production Change - Criticality 2 C2

Change Objective	In order to add more workloads to the GKE cluster in production we are going to increase the node pool size
Change Type	C2
Services Impacted	Registry
Change Team Members	@jarv @skarbek
Change Severity	~S3
Buddy check	@skarbek
Tested in staging	We have tested the same procedure on staging and confirmed there is no interruption of service
Schedule of the change	2019-09-26 12:00 UTC
Duration of the change	30 minutes
Detailed steps for the change. Each step must include:	- pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step

Summary

This change is necessary so we can add more workloads to the production GKE cluster. The following procedure has already been tested on preprod, we will be running this procedure on staging and then production.

Pods are evicted and workloads are moved from the old instance pool, to the new instance pool. No service interruption is expected during this transition.

Monitoring

During the change we will be continuously testing the registry service both by manually pushing and pulling images and by monitoring the following dashboards:

Logs - https://log.gitlab.net/goto/43fb211edc754b58d4e22f3142eaeb8f
Application Metrics: https://dashboards.gitlab.net/d/CoBSgj8iz/application-info
Pod Metrics: https://dashboards.gitlab.net/d/oWe9aYxmk/pod-metrics
Registry service error ratios - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now&fullscreen&panelId=8
General service metrics, registry - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now

Manual testing during the procedure

While we test in staging and production we will continuously pull a registry image.

Staging

while true; do  \
  echo "-----" | ts; \
  docker pull registry.staging.gitlab.com/jarv/registry-test 2>&1 | ts; \
  docker image rm registry.staging.gitlab.com/jarv/registry-test 2>&1 | ts; \
done

Production

while true; do  \
  echo "-----" | ts; \
  docker pull registry.gitlab.com/jarv/registry-test 2>&1 | ts; \
  docker image rm registry.gitlab.com/jarv/registry-test 2>&1 | ts; \
done

Procedure

Switch to the appropriate kubectx (staging or production)
Add a new node pool in terraform node-pool-1 that has the bigger instance size:
- production https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1070/
- staging https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1071
Confirm there are new nodes in the pool

kubectl get nodes -o wide

Cordon node-pool-0

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do \
  kubectl cordon "$node"; \
  read -p "Node $node cordoned, enter to continue ..."; \
done

Run the following in a terminal window to monitor the pods

 watch kubectl get pods --all-namespaces -o wide

Evict pods from node-pool-0

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do \
  kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node"; \
  read -p "Node $node drained, enter to continue ..."; \
done

Confirm that with dashboards and manual testing that the registry service is healthy

Cleanup

In order to delete the old pool it will require some manual tasks to ensure clean terraform runs

Delete node-pool-0 manually in the console
Delete the node pools from terraform state

tf state rm module.gitlab-gke.google_container_node_pool.node_pool[0]
tf state rm module.gitlab-gke.google_container_node_pool.node_pool[1]

Import the node pool into terraform state

tf import module.gitlab-gke.google_container_node_pool.node_pool[0] gitlab-production/us-east1/gprd-gitlab-gke/node-pool-20190927

Remove node-pool-0 from terraform, and confirm a clean state

tf plan -target module.gitlab-gke

Rollback

To allow workloads to be scheduled on the old node pool:

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=node-pool-0 -o=name); do k uncordon "$node"; done

Edited Sep 26, 2019 by John Jarvis