Make temp haproxy load balancers for registry more robust for the short term
C1
Production Change - Criticality 1Change Component | Description |
---|---|
Change Objective | Add more nodes to our temp haproxy registry fleet for AZ availability |
Change Type | Operation |
Services Impacted | Registry |
Change Team Members | @alejandro @cmcfarland |
Change Criticality | C1 |
Change Reviewer | A colleague who will review the change |
Tested in staging | N/A |
Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
Due Date | 20:15 2020-07-08 UTC |
Time tracking | 60 minutes ( including a possible rollback ) |
Downtime Component | 0 |
Detailed steps for the change
-
Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1913 -
Targeted apply to create two new fe-haproxy-[03,04]-temp nodes: tf plan -target "module.fe-lb-registry-temp.google_compute_instance.default[2]" -target "module.fe-lb-registry-temp.google_compute_instance.default[3]" -out create-temp.out
-
Add the new nodes to the registry google load balancers (http and https). -
Make sure that the new nodes are taking traffic properly. -
Targeted apply to update/replace the existing fe-haproxy-[01,02]-temp nodes to the right availability zones. -
Add the replaced nodes to the registry google load balancers (https and https). -
Make sure that the new nodes are taking traffic properly. -
Targeted apply to verify that the google load balancers are updated properly. If there are changes noted, import the right state or update the terraform code to match.
Rollback steps
Monitoring
Key metrics to observe
- Metric: Registry Service Metrics
- Location: https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&refresh=30s&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=registry&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: We want to see nominal RPS and low to no error rates!
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents
Edited by Alejandro Rodríguez