Transition Container Registry Running Virtual Machines to Kubernetes Pods
Production Change - Criticality 4 C4
| Swap Container Registry VM's to Kubernetes Pods | Let's slowly transition from running the Container Registry VMs to using Pods in GKE instead |
|---|---|
| Change Type | Change |
| Services Impacted | Container Registry |
| Change Team Members | @jarv @skarbek |
| Change Severity | C4 |
| Buddy check or tested in staging | Any interested parties? |
| Schedule of the change | 2019-08-20 through TBD |
| Duration of the change | This is a transitional change that takes just seconds to make the change, monitor for a bit of time and rollback. The transition period does not have a specific end date as we have other documents we'd like to be reviewed during this procedure |
Prelude
GKE is already configured, the Container Registry is already running in GKE, and the Haproxy nodes already have the GKE Service Endpoint ready to go. By default the weight is set to 0. With this value Kubernetes will not see any production traffic.
graph TB;
a(fe-registry-0X-lb-gprd) -->|100/UP| b(registry-01-sv-gprd);
a -->|100/UP| c(registry-02-sv-gprd);
a -->|100/UP| d(registry-03-sv-gprd);
a -->|100/UP| e(registry-04-sv-gprd);
a -->|0/UP| f(gke-registry);
linkStyle 0 stroke-width:2px,fill:none,stroke:green;
linkStyle 1 stroke-width:2px,fill:none,stroke:green;
linkStyle 2 stroke-width:2px,fill:none,stroke:green;
linkStyle 3 stroke-width:2px,fill:none,stroke:green;
linkStyle 4 stroke-width:2px,fill:none,stroke:blue;
Proposal
We'd like to slowly ramp up and monitor our infrastructure through this change. By the end of this week (2019-08-23) we'd like Kubernetes to be accepting 20% of all Container Registry traffic. We'll do this in a controlled fashion in steps, hosting a Recorded Zoom call each time we change the weight of the GKE endpoint.
Ramp Up
The table below outlines when the change is requested, the value of the weight to be set on the GKE Service Endpoint, which equals the estimated traffic that endpoint will see. This value is derived from the current weight of all VM's (100) and the fact that we have 4 total VMs. The GKE Service Endpoint provides a 5th endpoint in haproxy. After a certain point, we'll want to put VM's in maintenance mode in order to ramp up traffic into GKE appropriately. It is desired that we perform these changes with minimally 4 hours of time between each change.
Due to the Release occurring on the 22nd, it is proposed to provide a blackout period around the time in which we announce the release to ensure stability of GitLab.com This must be coordinated with the Delivery team and members of #release-post. We should delay this change request +/- 4 hours around the 12.2.0 announcement.
Once the gke cluster weight is set to 100, setting VMs into to maintenance mode in haproxy since haproxy cannot handle a weight integer larger than 255. This issue description to be updated with that plan of action after the production readiness review has been reviewed.
If we pass the readiness review, a chef configuration change should occur that places the GKE Service Endpoint into the same configuration as our servers. This will set the default weight of this endpoint to 100. Once this is done, we'll need to perform setting VM's into maintenance mode AND adjust the weight of the GKE service endpoint in order to direct the desired amount of traffic.
Monitoring and Logging
- Logs - https://log.gitlab.net/goto/46f7e54eedee88c00478447407bb40d4
- Application Metrics: https://dashboards.gitlab.net/d/CoBSgj8iz/application-info
- Pod Metrics: https://dashboards.gitlab.net/d/oWe9aYxmk/pod-metrics
- Registry service error ratios - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now&fullscreen&panelId=8
- General service metrics, registry - https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=registry&from=now-1h&to=now
Procedure
| Date | GKE Endpoint Weight Config | Traffic % | Notes | Apply |
|---|---|---|---|---|
|
10 | 2 | all VM's are online | ./bin/set-weights gprd gke-registry 10 |
|
50 | 10 | all VM's are online | ./bin/set-weights gprd gke-registry 50 |
|
100 | 20 | all VM's are online | ./bin/set-weights gprd gke-registry 100 |
|
N/A | N/A | - |
N/A |
|
100 | 33 | set VMs registry-03 and registry-04 into MAINT
|
./bin/set-server-state gprd maint registry-04 ./bin/set-server-state gprd maint registry-03
|
|
255 | 56 | increase the weight of the gke cluster | ./bin/set-weights gprd gke 255 |
|
255 | 72 | set VM registry-02 into MAINT
|
./bin/set-server-state gprd maint registry-02 |
|
100 | 100 | set VM registry-01 into MAINT
|
./bin/set-server-state gprd maint registry-01 ./bin/set-weights gprd gke 100
|
Rollback
At any time to rollback to the VMs the following commands can be run to move traffic back to the VMs
./bin/set-server-state gprd ready registry-04
./bin/set-server-state gprd ready registry-03
./bin/set-server-state gprd ready registry-02
./bin/set-server-state gprd ready registry-01
# Ensure that the registry VMs are online
./bin/set-weights gprd gke 0
Expected Outcome
After the above has completed, our configuration will look like the following
graph TB;
a(fe-registry-0X-lb-gprd) -->|100/MAINT| b(registry-01-sv-gprd);
a -->|100/MAINT| c(registry-02-sv-gprd);
a -->|100/MAINT| d(registry-03-sv-gprd);
a -->|100/MAINT| e(registry-04-sv-gprd);
a -->|100/UP| f(gke-registry);
linkStyle 0 stroke-width:2px,fill:none,stroke:brown;
linkStyle 1 stroke-width:2px,fill:none,stroke:brown;
linkStyle 2 stroke-width:2px,fill:none,stroke:brown;
linkStyle 3 stroke-width:2px,fill:none,stroke:brown;
linkStyle 4 stroke-width:2px,fill:none,stroke:green;
Cleanup
Once this change is complete, the old VMs will be removed. This is tracked in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7643