Transition Container Registry Running Virtual Machines to Kubernetes Pods

Production Change - Criticality 4 C4

Swap Container Registry VM's to Kubernetes Pods Let's slowly transition from running the Container Registry VMs to using Pods in GKE instead
Change Type Change
Services Impacted Container Registry
Change Team Members @jarv @skarbek
Change Severity C4
Buddy check or tested in staging Any interested parties?
Schedule of the change 2019-08-20 through TBD
Duration of the change This is a transitional change that takes just seconds to make the change, monitor for a bit of time and rollback. The transition period does not have a specific end date as we have other documents we'd like to be reviewed during this procedure

Prelude

GKE is already configured, the Container Registry is already running in GKE, and the Haproxy nodes already have the GKE Service Endpoint ready to go. By default the weight is set to 0. With this value Kubernetes will not see any production traffic.

graph TB;
  a(fe-registry-0X-lb-gprd) -->|100/UP| b(registry-01-sv-gprd);
  a -->|100/UP| c(registry-02-sv-gprd);
  a -->|100/UP| d(registry-03-sv-gprd);
  a -->|100/UP| e(registry-04-sv-gprd);
  a -->|0/UP| f(gke-registry);
  linkStyle 0 stroke-width:2px,fill:none,stroke:green;
  linkStyle 1 stroke-width:2px,fill:none,stroke:green;
  linkStyle 2 stroke-width:2px,fill:none,stroke:green;
  linkStyle 3 stroke-width:2px,fill:none,stroke:green;
  linkStyle 4 stroke-width:2px,fill:none,stroke:blue;

Proposal

We'd like to slowly ramp up and monitor our infrastructure through this change. By the end of this week (2019-08-23) we'd like Kubernetes to be accepting 20% of all Container Registry traffic. We'll do this in a controlled fashion in steps, hosting a Recorded Zoom call each time we change the weight of the GKE endpoint.

Ramp Up

The table below outlines when the change is requested, the value of the weight to be set on the GKE Service Endpoint, which equals the estimated traffic that endpoint will see. This value is derived from the current weight of all VM's (100) and the fact that we have 4 total VMs. The GKE Service Endpoint provides a 5th endpoint in haproxy. After a certain point, we'll want to put VM's in maintenance mode in order to ramp up traffic into GKE appropriately. It is desired that we perform these changes with minimally 4 hours of time between each change.

Due to the Release occurring on the 22nd, it is proposed to provide a blackout period around the time in which we announce the release to ensure stability of GitLab.com This must be coordinated with the Delivery team and members of #release-post. We should delay this change request +/- 4 hours around the 12.2.0 announcement.

Once the gke cluster weight is set to 100, setting VMs into to maintenance mode in haproxy since haproxy cannot handle a weight integer larger than 255. This issue description to be updated with that plan of action after the production readiness review has been reviewed.

If we pass the readiness review, a chef configuration change should occur that places the GKE Service Endpoint into the same configuration as our servers. This will set the default weight of this endpoint to 100. Once this is done, we'll need to perform setting VM's into maintenance mode AND adjust the weight of the GKE service endpoint in order to direct the desired amount of traffic.

Monitoring and Logging

  • Logs - https://log.gitlab.net/goto/46f7e54eedee88c00478447407bb40d4
    • more precise filtering from the start: https://log.gitlab.net/goto/bfd05b1b5a0793d82bd0fc52d2a1cc5a https://log.gitlab.net/goto/b45b1b744081832396d1323db39b5879 https://log.gitlab.net/goto/43fb211edc754b58d4e22f3142eaeb8f
  • Application Metrics: https://dashboards.gitlab.net/d/CoBSgj8iz/application-info
  • Pod Metrics: https://dashboards.gitlab.net/d/oWe9aYxmk/pod-metrics
  • Registry service error ratios - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now&fullscreen&panelId=8
  • General service metrics, registry - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now

Procedure

Date GKE Endpoint Weight Config Traffic % Notes Apply
  • 2019-08-27 CEST
10 2 all VM's are online ./bin/set-weights gprd gke-registry 10
  • 2019-08-27 CEST
50 10 all VM's are online ./bin/set-weights gprd gke-registry 50
  • 2019-08-27 EDT
100 20 all VM's are online ./bin/set-weights gprd gke-registry 100
  • 2019-08-27 EDT
N/A N/A - Merge the following before continuing the steps below: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1676 N/A
  • 2019-08-27 EDT
100 33 set VMs registry-03 and registry-04 into MAINT ./bin/set-server-state gprd maint registry-04 ./bin/set-server-state gprd maint registry-03
  • 2019-08-28 EDT
255 56 increase the weight of the gke cluster ./bin/set-weights gprd gke 255
  • 2019-08-27 CEST
255 72 set VM registry-02 into MAINT ./bin/set-server-state gprd maint registry-02
  • 2019-08-28 EDT
100 100 set VM registry-01 into MAINT ./bin/set-server-state gprd maint registry-01 ./bin/set-weights gprd gke 100

Rollback

At any time to rollback to the VMs the following commands can be run to move traffic back to the VMs


./bin/set-server-state gprd ready registry-04
./bin/set-server-state gprd ready registry-03
./bin/set-server-state gprd ready registry-02
./bin/set-server-state gprd ready registry-01
# Ensure that the registry VMs are online
./bin/set-weights gprd gke 0

Expected Outcome

After the above has completed, our configuration will look like the following

graph TB;
  a(fe-registry-0X-lb-gprd) -->|100/MAINT| b(registry-01-sv-gprd);
  a -->|100/MAINT| c(registry-02-sv-gprd);
  a -->|100/MAINT| d(registry-03-sv-gprd);
  a -->|100/MAINT| e(registry-04-sv-gprd);
  a -->|100/UP| f(gke-registry);
  linkStyle 0 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 1 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 2 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 3 stroke-width:2px,fill:none,stroke:brown;
  linkStyle 4 stroke-width:2px,fill:none,stroke:green;

Cleanup

Once this change is complete, the old VMs will be removed. This is tracked in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7643

Edited Aug 30, 2019 by John Jarvis
Assignee Loading
Time tracking Loading