[GSTG/GPRD] CI Runners Gameday
- Production Change - Criticality 2 ~C2
C2
Production Change - Criticality 2Change Summary
This production issue will be used for Gamedays to simulate recovery in case of a zonal outage. The test simulates an outage without shutting down any CI servers to prevent service disruption. Instead, we’ll increase capacity in one of the shard colours to observe how the system manages load distribution. The goal is to ensure that if one zone goes down, the system can automatically balance the load and continue running smoothly after increased capacity.
The following steps will be followed when restoring CI Runners capacity in a single zone.
Gameday execution roles and details
Role | Assignee |
---|---|
Change Technician | @swainaina |
Change Reviewer | @astarovoytov |
-
Services Impacted - ServiceCI Runners
-
Time tracking -
90 minutes
-
Downtime Component -
0 minutes
Provide a brief summary indicating the affected zone
[For Gamedays only] Preparation Tasks
One week before the gameday
-
Add an event to the GitLab Production calendar. -
Make an announcement on the #f_gamedays Slack channel with this template: Next week on [DATE & TIME] we will be executing a CI Runners recovery game day. The process will involve emulating a single zone outage for the runners in `gstg` to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/xxxxx>
Then cross-post the message to the following channels:
-
#g_production_engineering -
#test-platform -
#staging (if applicable) -
#production (if applicable)
-
-
Mention the release managers on the Slack announcement by mentioning @release-managers
and await their approval. -
Request approval from the Infrastructure manager, wait for approval and confirm by the manager_approved label.
Just before the gameday begins
-
Before commencing the change, notify the EOC and release managers on Slack with the following template and wait for their acknowledgement and approval @release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution. We will be emulating a single zone outage for CI Runners in gstg to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets. Kindly review and approve the CR
Detailed steps for the change
Change Steps - steps to take to execute the change
Execution
-
If you are conducting a practice (Gameday) run of this, consider starting a recording of the process now. -
Note the start time in UTC in a comment to record this process duration. -
Set label changein-progress /label ~change::in-progress
-
Identify the currently active color and the current version. In this test, we will target the saas-linux-2xlarge-amd64
shard which from our earlier analyses is less busy than the others:-
Open the CI Runners Versions view and locate the 'instance' column to identify the active deployment colour for a shard and version. Record the information in the comment section of this issue. -
Record the number of active Runner Managers in the specific shard
-
-
Create an MR to update the version (to the version gathered in the previous step) on the inactive color in the required shard (for ex. https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5027/diffs). Merge the MR and wait for the CI to upload changes to Chef Server -
After the above MRs merged and pipelines completed deploy the inactive colour for a shard using ChatOps command in the #production
Slack channel-
/runner run start {shard_name} {in_active_colour}
. Replaceshard_name
with the required shard andcolour
with the inactive colour (blue
orgreen
). The deployment job will start. e.g.
/runner run start saas-linux-2xlarge-amd64 green
-
verify the deployment at CI Runners dashboard for a specific shard (choose a required shared in the dashboard)
-
-
Confirm that the new number of active Runner Managers in the specific shard increased -
Note the conclusion time in UTC in a comment to record this process duration.
Validation
Once traffic is restricted to our remaining two zones, let's identify the impact and look for problems.
-
CI Runners Deployment overview -
CI Runner Versions view -
Confirm the jobs is successful after running the ChatOps command e.g. -
Confirm at least some jobs have been provisioned in the new servers 👉 CI Runner Jobs
Wrapping up and cleanup
-
Revert the MR to update the version on the inactive colour from the Execution steps above -
After the above MRs are merged and pipelines are completed and drain the previously changed colour for a shard using the ChatOps
command in the#production
Slack channel-
/runner run stop {shard_name} {same_colour}
i.e.
/runner run stop saas-linux-2xlarge-amd64 green
-
verify the deployment at CI Runners dashboard for a specific shard (choose a required shared in the dashboard)
-
-
Confirm that the number of active Runner Managers in the specific shard returned to pre-change value -
Set label changecomplete /label ~change::complete
-
Notify the @release-managers
and@sre-oncall
that the exercise is complete. -
Compile the real-time measurement of this process and update the Recovery Measurements Runbook.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
It is estimated that this will take 5m to complete
-
Revert the MR to update the version on the inactive colour from the Execution steps above -
After the above MRs are merged and pipelines are completed and drain the previously changed colour for a shard using the ChatOps
command in the#production
Slack channel-
/runner run stop {shard_name} {same_colour}
i.e.
/runner run stop saas-linux-2xlarge-amd64 green
-
verify the deployment at CI Runners dashboard for a specific shard (choose a required shared in the dashboard)
-
-
Confirm that the number of active Runner Managers in the specific shard returned to pre-change value -
Set label changecomplete /label ~change::aborted
-
Notify the @release-managers
and@sre-oncall
that the exercise has been aborted.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed upon with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results are noted in a comment on this issue.
- A dry-run has been conducted and results are noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed before the change is rolled out. (In the #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In the #production channel, mention
@release-managers
and this issue and await their acknowledgement.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.