Intentional Switchover of DB Primary
Production Change
Change Summary
Provide a high-level summary of the change and its purpose.
Change Details
- Services Impacted - SSH, WEB, API (GitLab.com)
- Change Technician - @ahmadsherif
- Change Criticality - C1
- Change Type - changeunscheduled
- Change Reviewer - @Finotto
- Due Date - 2021-05-08 ~14:45 UTC
- Time tracking - 30 mins
- Downtime Component - 2 min
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
- Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
gitlab-patronictl list
- verify the status from the cluster -
gitlab-patronictl switchover --master patroni-v12-01-db-gprd.c.gitlab-production.internal --candidate patroni-v12-02-db-gprd.c.gitlab-production.internal
- execute the failover
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
gitlab-patronictl list
- verify the status from the cluster
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
gitlab-patronictl list
- verify the status from the cluster -
gitlab-patronictl switchover --master patroni-v12-02-db-gprd.c.gitlab-production.internal --candidate patroni-v12-01-db-gprd.c.gitlab-production.internal
- execute the failover
Monitoring
Key metrics to observe
- Metric: Postgresql Overview dashboard
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1
- What changes to this metric should prompt a rollback: In case we do not see traffic on the patroni-02.
- Patroni replication overview dashboard
- Location: https://dashboards.gitlab.net/d/000000244/postgresql-replication-overview?orgId=1&from=now-3h&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni
- What changes to this metric should prompt a rollback: verify if the leader of the cluster is still the patroni-v12-01.
Summary of infrastructure changes
- Does this change introduce new compute instances?
- Does this change re-size any existing compute instances?
- Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
- This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
- This issue has the change technician as the assignee.
- Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
- Necessary approvals have been completed based on the Change Management Workflow.
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
-
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - There are currently no active incidents.