Move production registry traffic to the zonal clusters
Production Change
Change Summary
This change will move Registry traffic from the regional cluster to the zonal cluster. We are doing this to better divide our traffic into zones which will make it easier for us to identify zone-specific issues.
- We have ~50 pods running in production servicing registry traffic
- The zonal clusters each have 30 pods running in each zone, which is the
minReplicas - Given that we will be taking appriximately 1/3rd of the traffic into the zonal clusters, this should be more than enough pods to match what is currently running in the regional cluster.
This change will be split into two parts:
- Part1: Split traffic between the regional and zonal clusters
- Part2: Move all non-canary traffic to the zonal clusters
Change Details
- Services Impacted - registry
- Change Technician - @jarv
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - @skarbek
- Due Date - 2020-11-30
- Time tracking - 60min
- Downtime Component - none
Detailed steps for the change
Split traffic between the regional and zonal clusters
-
Stop chef on HAProxy registry VMs
knife ssh 'roles:gprd-base-lb-registry-config' 'sudo chef-client-disable'
-
Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4642 -
Run chef on fe-registry-01-lb-gprd.c.gitlab-production.internalto validate chef run, check backends for health
sudo chef-client
# SSH forward to check backend status
-
Start chef on all HAProxy VMs
Move all traffic to zonal clusters
-
Stop chef on HAProxy registry VMs
knife ssh 'roles:gprd-base-lb-registry-config' 'sudo chef-client-disable'
-
Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4643 -
Run chef on fe-registry-01-lb-gprd.c.gitlab-production.internalto validate chef run, check backends for health
sudo chef-client
# SSH forward to check backend status
-
Start chef on all HAProxy VMs
Rollback
-
Part1: Revert and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4642 -
Part2: Revert and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4643
Monitoring
Key metrics to observe
- Registry service overview https://dashboards.gitlab.net/d/registry-main/registry-overview?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- Registry pod info (by cluster) https://dashboards.gitlab.net/d/registry-pod/registry-pod-info?orgId=1
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue.) -
There are currently no active incidents.
Edited by John Jarvis