[GSTG] Gameday - Patroni Zonal Outage
C2
Production Change - Criticality 2Change Summary
This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when adding new patroni replicas in a new zone.
There are several Patroni clusters. Be sure to include the ones required for operation when conducting this process, regardless of the specific steps.
- main
- ci
- registry
- embedding
- security (in progress at this time)
Gameday execution roles and details
Role | Assignee |
---|---|
Change Technician | @knottos @stejacks-gitlab |
Change Reviewer | @swainaina |
- Services Impacted - ~"Service::PatroniV16" ServicePatroniCiV16 ~"Service::PatroniRegistryV16"
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-21 13:00
- Time tracking - 90
Provide a brief summary indicating the affected zone
[For Gamedays only] Preparation Tasks a Week in Advance
-
One week before the gameday make an announcement on slack #f_gamedays, copy the link to theproduction_engineering channel and request approval from @release-managers
. Consider also sharing this post in the appropriate environment channels; staging or production.- Example message:
Next week on [DATE & TIME] we will be executing a Patroni and PGBouncer zonal outage game day. The process will involve provisioning additional capacity for these services in alternate zones. This should take approximately 90 minutes, the aim for this exercise is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See LINK_TO_CHANGE_ISSUE for details.
-
Request approval from the Infrastructure manager, wait for approval and confirm by the manager_approved label. -
Add an event to the GitLab Production calendar. The change scheduler can help with this.
Tasks an Hour Before Executing
-
Notify the release managers on Slack by mentioning @release-managers
and referencing this issue and await their acknowledgment. -
Notify the eoc on Slack by mentioning @sre-oncall
and referencing this issue and wait for approval by adding the eoc_approved label.- Example message:
@release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution. We will be adding Patroni and PGBouncer replicas for approximately 30 minutes. Kindly review and approve the CR
-
Post an FYI link of the slack message to the #test-platform channel on slack.
Detailed steps for the change
Change Steps - steps to take to execute the change
Execution
-
For gamedays, consider starting a recording of the process now. -
Note the start time in UTC in a comment to record this process duration. -
Set label changein-progress /label ~change::in-progress
-
Create the MR to provision temporary Patroni and PGBouncer replicas in a new (good) zone. -
Create the MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/11553 - Example MRs:
-
Notes:
-
atlantis plan/apply
might fail if there are missing snapshots, a quick look at the runbook here might be helpful - When adding new instances, consider using 1xx numbering to help differentiate new DR created nodes. For example, if there are five cluster members alredy as 1 .. 5, new VMs would be 106, 107, etc.
- For Patroni VMs, add more nodes to the modules
nodes
map. - For PGBouncer VMs, create a
nodes_overrides
map.
-
-
Get the MR approved -
Apply the MR -
For gamedays, note the time in UTC in a comment when you run atlantis apply
, we can use the data to help calculate theVM Provision time
❗ ❗ NOTE:❗ ❗ Merging the changes into master can take a while to complete, it can take up to 30 minutes before you are able to log in to the new replicas. -
Validation of the Patroni replicas
-
Wait for the instances to be built and Chef to converge. -
Staging: Confirm that chef runs have completed for new patroni replicas it can take up to 30 minutes before they show up. A sudden increase in the y-axis of the time across the new replicas may signify that Chef has run successfully.
Trouble shooting tips
- Tail the serial output to confirm that the start up script executed successfully for all the new Patroni and PGBouncer nodes and wait until there is no replication lag.
gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1
the variables$project
represents the projectgitlab-staging-1
for the patroni replicas,$instance_name
represent the instance e.gpatroni-main-v16-106-db-gstg
and$zone
represents the recovery zone e.gus-east1-b
. - We could also tail bootstrap logs example:
tail -f /var/tmp/bootstrap-20231108-133642.log
, if we see a log that statesbootstrap completed
we can move on to the next steps.
- Tail the serial output to confirm that the start up script executed successfully for all the new Patroni and PGBouncer nodes and wait until there is no replication lag.
-
-
Once the patroni replicas are ready ssh into each replica and validate that Patroni is running: sudo gitlab-patronictl list
- Restart Patroni if not:
sudo systemctl enable patroni && sudo systemctl start patroni
-
Review graphs for each of the database clusters to ensure traffic is being distributed to the new instances. -
Validate error rates are nominal across the rails services. -
If you are executing gamedays and will not leave these VMs provisioned, run the timing collection scripts now and refer to the wrapping-up section below.
Switchover Patroni Leader
-
Identify and note the current Leader for main
Patroni shard (we can omitci
,registry
andembedding
, as the process is the same and will not give us any additional benefits under this gameday)
-
Run sudo gitlab-patronictl list
command on any patroni-main node. This will provide the details about the cluster and the current leader
-
Run sudo gitlab-patronictl switchover
command on any patroni-main node
-
Run sudo gitlab-patronictl list
command on any patroni-main node again and observe if the Leader changed
Validation of the Leader switchover
- PostgreSQL Replication overview
- Monitor what pgbouncer pool has connections
- Review WRITES going to the cluster
- Review READs going to the cluster
-
Note the end time in UTC in a comment to record the completion of this change
[For Gamedays only] Clean up
Patroni Leader switchover:
-
Identify and note the current Leader for main
Patroni shard-
Run sudo gitlab-patronictl list
command on any patroni-main node
-
-
Run sudo gitlab-patronictl switchover
command on any patroni-main node and specify the pre-change leader as a target.-
Run sudo gitlab-patronictl list
command on any patroni-main node again and observe if the Leader changed
-
-
Set all new Patroni nodes to maintenance mode. Adjust the replacement nodes names in the below scripts before the execution -
Set maintenance mode roles:
for node in patroni-ci-v16-{105,106} patroni-embedding-04 patroni-main-v16-{105,106} patroni-registry-v16-05; do knife node run_list add "${node}-db-gstg.c.gitlab-staging-1.internal" 'role[gstg-base-db-patroni-maintenance]' -y; done
-
Run chef-client:
for node in patroni-ci-v16-{105,106} patroni-embedding-04 patroni-main-v16-{105,106} patroni-registry-v16-05; do echo "${node}-db-gstg.c.gitlab-staging-1.internal" ; done | xargs -P0 -I '{}' ssh {} 'sudo chef-client'
❗ ❗ NOTE❗ ❗ The hostnames in these commands are examples. Be sure to update them to match the newly created nodes accordingly.
-
-
Open and merge a MR to remove the nodes added for this game day. https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/11554 ❗ ❗ NOTE❗ ❗ when removing the nodes created as part of this gameday, you may need to add the~approve_policies
label and commentatlantis approve_policies
in the MR to bypass the policies before applying withatlantis apply
.
Wrapping up
-
Notify the @release-managers
and@sre-oncall
that the exercise is complete. -
Set label changecomplete /label ~change::complete
-
Compile the real time measurement for all the new Patroni and PGBouncer nodes by running the script, for example:
# For new Patroni Nodes
for node in patroni-ci-v16-{105,106} patroni-embedding-04 patroni-main-v16-{105,106} patroni-registry-v16-05; do ./scripts/find-bootstrap-duration.sh $node-db-gstg.c.gitlab-staging-1.internal ; done
#For PGBouncer Nodes
for node in pgbouncer-{07,08}-db-gstg pgbouncer-ci-04-db-gstg pgbouncer-sidekiq-04-db-gstg pgbouncer-sidekiq-ci-04-db-gstg; do ./scripts/find-bootstrap-duration.sh $node-db-gstg.c.gitlab-staging-1.internal ; done
Update the Recovery Measurements Runbook.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
It is estimated that this will take 5m to complete
-
Notify the @release-managers
and@sre-oncall
that the exercise is aborted. -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Completed Chef runs: Staging
- Patroni service: Staging
- PostgreSQL Overview: Staging
- Replication Lag: Staging
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed upon with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results are noted in a comment on this issue.
- A dry-run has been conducted and results are noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed before the change is rolled out. (In the #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In the #production channel, mention
@release-managers
and this issue and await their acknowledgement.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.