[gstg] HAProxy zonal outage gameday
Production Change - Criticality 2 C2
Change Summary
This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when testing traffic shifts due to zonal outages. Hopefully corrective actions from testing will help us build new steps to take during a real outage.
Gameday execution roles and details
| Role | Assignee |
|---|---|
| Change Technician | @ayeung |
| Change Reviewer | @astarovoytov |
-
Services Impacted - TBD
-
Time tracking -
90 minutes
-
Downtime Component -
30 minutes
This gameday will simulate an outage in us-east1-c.
[For Gamedays only] Preparation Tasks
One week before the gameday
-
Add an event to the GitLab Production calendar. -
Make an announcement on Slack following this template: Next week on [DATE & TIME] we will be executing a Traffic Routing game day. The process will involve moving traffic away from a single zone in `gstg` to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17274>Post the message in the following channels:
-
#g_production_engineering -
#test-platform -
#staging (if applicable) -
#production (if applicable)
-
-
Notify the release managers on Slack by mentioning @release-managersand referencing this issue and await their acknowledgment.
Just before the gameday begins
-
Before commencing the change, notify the EOC and release managers on Slack following this template: @release-managers or @sre-oncall [LINK_TO_THIS_CR] is scheduled for execution. We will be diverting traffic away from a single zone ([NAME_OF_ZONE]) in `gstg` to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets. Kindly review and approve the CR.
Detailed steps for the change
Change Steps - steps to take to execute the change
Execution
-
If you are conducting a practice (Gameday) run of this, consider starting a recording of the process now. -
Note the start time in UTC in a comment to record this process duration. -
Set label changein-progress /label ~change::in-progress -
Clone the chef-reporepository if you haven't already:git clone git@gitlab.com:gitlab-com/gl-infra/chef-repo.git -
In your terminal, set the ZONEenvironment variable to the zone you will be diverting traffic away from.export ZONE=us-east1-c -
Reconfigure the regional cluster to exclude the affected zone by setting regional_cluster_zonesin Terraform to a list of zones that are not impacted-
Create the MR to update the regional_cluster_zones. While emulating a zonal outage make sure to create replacement nodes. Refer to this example MR.#Command to check HAProxy nodes in a particular zone knife search node "name:haproxy* AND chef_environment:gstg AND zone:*${ZONE}" -
Get the MR approved -
Merge the MR
-
-
Reconfigure the HAProxy node pools to exclude the nodes in the affected zone and include the new nodes you created in the previous step. -
Create the MR to replace the nodes in the affected zone with the new replacement nodes. Refer to this example MR. -
Get the MR approved -
⚠ Make sure the MR to update theregional_cluster_zonesin the previous step has been merged, the new nodes provisioned and completed bootstrapping before merging this MR! It can take up to half an hour. You can quickly check this by attempting to SSH into the new nodes.⚠ -
Trigger a chef-client run on all HAProxy nodes once the chef-repo pipeline has completed: knife ssh 'name:haproxy* AND chef_environment:gstg' 'sudo chef-client'
-
-
Remove the HAProxy instances from the GCP load balancers (this must be done AFTER the above Terraform change is applied): cd chef-repo ./bin/manage-gcp-lb-haproxy- The script will prompt for an environment, then a zone. Select the values correlating to the zone we are removing traffic from.
-
Disable the HAproxy servers: cd chef-repo ./bin/disable-server gstg ${ZONE}-
Validate that all servers in the affected zone have their state set to MAINT. You can use this query to confirm that there are zero backends in theUPstate for the zone that you are removing traffic from.
-
-
Note the conclusion time in UTC in a comment to record this process duration.
Validation
Once traffic is restricted to our remaining two zones, let's identify the impact and look for problems.
-
Do we see a drop in CPU usage in one zone cluster? GSTG Per Cluster CPU Usage -
Do we see a drop in HPA targets in one zone cluster? GSTG Per Cluster HPA Target -
Examine GSTG Rails logs for errors -
Examine frontend dashboard for GSTG -
Examine the connected peers changed with the new introduced nodes
Wrapping up and cleanup
-
Re-enable the zonal GKE backend cluster in HAProxy cd chef-repo ./bin/enable-server gstg ${ZONE}-
Validate that all servers in the affected zone have their state set to UP. You can use this query to confirm that there are zero backends in theMAINTstate in the zone that you are returning to service.
-
-
Open a MR to revert the change on the chef-repo (Example MR) -
Get the MR approved -
Merge the MR -
Trigger a chef-client run on all HAProxy nodes once the chef-repo pipeline has completed: knife ssh 'name:haproxy* AND chef_environment:gstg' 'sudo chef-client'
-
-
Open a MR to revert the change to disable the zone in the regional cluster. (This will also revert the changes made to the GCP loadbalancers with gcloudcommands)-
Get the MR approved -
Make sure the MR in chef-repo to remove the Gameday HAProxy nodes from peering has been merged and chef client run on the nodes -
Merge the MR
-
-
Set label changecomplete /label ~change::complete -
Notify the @release-managersand@sre-oncallthat the exercise is complete. -
Compile the real time measurement of this process and update the Recovery Measrements Runbook.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
It is estimated that this will take 5m to complete
-
Re-enable HAProxy cd chef-repo ./bin/enable-server gstg ${ZONE} -
Set label changecomplete /label ~change::aborted -
Notify the @release-managersand@sre-oncallthat the exercise has been aborted.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed upon with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results are noted in a comment on this issue.
- A dry-run has been conducted and results are noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed before the change is rolled out. (In the #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In the #production channel, mention
@release-managersand this issue and await their acknowledgement.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.