Transition Container Registry Running Virtual Machines to Kubernetes Pods

Production Change - Criticality 4 C4

Swap Container Registry VM's to Kubernetes Pods	Let's slowly transition from running the Container Registry VMs to using Pods in GKE instead
Change Type	Change
Services Impacted	Container Registry
Change Team Members	@jarv @skarbek
Change Severity	C4
Buddy check or tested in staging	Any interested parties?
Schedule of the change	2019-08-20 through TBD
Duration of the change	This is a transitional change that takes just seconds to make the change, monitor for a bit of time and rollback. The transition period does not have a specific end date as we have other documents we'd like to be reviewed during this procedure

Prelude

GKE is already configured, the Container Registry is already running in GKE, and the Haproxy nodes already have the GKE Service Endpoint ready to go. By default the weight is set to 0. With this value Kubernetes will not see any production traffic.

graph TB;
  a(fe-registry-0X-lb-gprd) -->|100/UP| b(registry-01-sv-gprd);
  a -->|100/UP| c(registry-02-sv-gprd);
  a -->|100/UP| d(registry-03-sv-gprd);
  a -->|100/UP| e(registry-04-sv-gprd);
  a -->|0/UP| f(gke-registry);
  linkStyle 0 stroke-width:2px,fill:none,stroke:green;
  linkStyle 1 stroke-width:2px,fill:none,stroke:green;
  linkStyle 2 stroke-width:2px,fill:none,stroke:green;
  linkStyle 3 stroke-width:2px,fill:none,stroke:green;
  linkStyle 4 stroke-width:2px,fill:none,stroke:blue;

Proposal

We'd like to slowly ramp up and monitor our infrastructure through this change. By the end of this week (2019-08-23) we'd like Kubernetes to be accepting 20% of all Container Registry traffic. We'll do this in a controlled fashion in steps, hosting a Recorded Zoom call each time we change the weight of the GKE endpoint.

Ramp Up

The table below outlines when the change is requested, the value of the weight to be set on the GKE Service Endpoint, which equals the estimated traffic that endpoint will see. This value is derived from the current weight of all VM's (100) and the fact that we have 4 total VMs. The GKE Service Endpoint provides a 5th endpoint in haproxy. After a certain point, we'll want to put VM's in maintenance mode in order to ramp up traffic into GKE appropriately. It is desired that we perform these changes with minimally 4 hours of time between each change.

Due to the Release occurring on the 22nd, it is proposed to provide a blackout period around the time in which we announce the release to ensure stability of GitLab.com This must be coordinated with the Delivery team and members of #release-post. We should delay this change request +/- 4 hours around the 12.2.0 announcement.

Once the gke cluster weight is set to 100, setting VMs into to maintenance mode in haproxy since haproxy cannot handle a weight integer larger than 255. This issue description to be updated with that plan of action after the production readiness review has been reviewed.

If we pass the readiness review, a chef configuration change should occur that places the GKE Service Endpoint into the same configuration as our servers. This will set the default weight of this endpoint to 100. Once this is done, we'll need to perform setting VM's into maintenance mode AND adjust the weight of the GKE service endpoint in order to direct the desired amount of traffic.

Monitoring and Logging

Logs - https://log.gitlab.net/goto/46f7e54eedee88c00478447407bb40d4
- more precise filtering from the start: ~~https://log.gitlab.net/goto/bfd05b1b5a0793d82bd0fc52d2a1cc5a~~ ~~https://log.gitlab.net/goto/b45b1b744081832396d1323db39b5879~~ https://log.gitlab.net/goto/43fb211edc754b58d4e22f3142eaeb8f
Application Metrics: https://dashboards.gitlab.net/d/CoBSgj8iz/application-info
Pod Metrics: https://dashboards.gitlab.net/d/oWe9aYxmk/pod-metrics
Registry service error ratios - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now&fullscreen&panelId=8
General service metrics, registry - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now

Procedure

Date	GKE Endpoint Weight Config	Traffic %	Notes	Apply
2019-08-27 CEST	10	2	all VM's are online	`./bin/set-weights gprd gke-registry 10`
2019-08-27 CEST	50	10	all VM's are online	`./bin/set-weights gprd gke-registry 50`
2019-08-27 EDT	100	20	all VM's are online	`./bin/set-weights gprd gke-registry 100`
2019-08-27 EDT	N/A	N/A	- ~~Merge the following before continuing the steps below: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1676~~	N/A
2019-08-27 EDT	100	33	set VMs registry-03 and registry-04 into `MAINT`	`./bin/set-server-state gprd maint registry-04` `./bin/set-server-state gprd maint registry-03`
2019-08-28 EDT	255	56	increase the weight of the gke cluster	`./bin/set-weights gprd gke 255`
2019-08-27 CEST	255	72	set VM registry-02 into `MAINT`	`./bin/set-server-state gprd maint registry-02`
2019-08-28 EDT	100	100	set VM registry-01 into `MAINT`	`./bin/set-server-state gprd maint registry-01` `./bin/set-weights gprd gke 100`

Rollback

At any time to rollback to the VMs the following commands can be run to move traffic back to the VMs


./bin/set-server-state gprd ready registry-04
./bin/set-server-state gprd ready registry-03
./bin/set-server-state gprd ready registry-02
./bin/set-server-state gprd ready registry-01
# Ensure that the registry VMs are online
./bin/set-weights gprd gke 0

Expected Outcome

After the above has completed, our configuration will look like the following

graph TB;
  a(fe-registry-0X-lb-gprd) -->|100/MAINT| b(registry-01-sv-gprd);
  a -->|100/MAINT| c(registry-02-sv-gprd);
  a -->|100/MAINT| d(registry-03-sv-gprd);
  a -->|100/MAINT| e(registry-04-sv-gprd);
  a -->|100/UP| f(gke-registry);
  linkStyle 0 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 1 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 2 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 3 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 4 stroke-width:2px,fill:none,stroke:green;

Cleanup

Once this change is complete, the old VMs will be removed. This is tracked in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7643

Edited Aug 30, 2019 by John Jarvis