2022-02-23 - PostgreSQL Minor Upgrade - Production
Production Change
Change Summary
Upgrade Patroni-Main
and Patroni-CI
clusters to PostgreSQL 12.9 on gprd
environment.
We will execute the upgrade, starting with all Patroni-CI
and Patroni-Main
replicas, finishing up with an upgrade to the Patroni-Main leader which will require application downtime during a restart of the PostgreSQL instance.
Related Epic: &625 (closed)
Change Details
- Services Impacted - ServicePatroni
-
Change Technician -
@mchacon3
- Change Reviewer - @mchacon3
- Time tracking - 90 minutes
- Downtime Component - YES
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 min
-
Set label changein-progress on this issue -
Create an Alert Silence with the following Matchers:
- env="gprd"
- type="patroni"
- Duration: 90 min
-
Login to gprd console server (console-01-sv-gprd.c.gitlab-production.internal) and open a screen session:
screen
-
If needed, install Ansible on the Console server:
sudo apt install python3-virtualenv
virtualenv myansible
source myansible/bin/activate
(myansible) $ pip install ansible
(myansible) $ ansible --version
-
Clone db-migration repo on the console server:
git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git
-
Check SSH access to the Patroni nodes from the console server:
cd db-migration/pg-upgrade-minor
./bin/ansible-exec.sh -e gprd -p ping
-
Disable Chef Client in all Patroni nodes:
./bin/ansible-exec.sh -e gprd -p disable_chef
-
Seek approval for https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1369 and merge. -
Login to patroni-v12-01-db-gprd.c.gitlab-production.internal
andpatroni-ci-01-db-gprd.c.gitlab-production.internal
and document current status of the clusters. Identify the leader and standby-leader:
sudo gitlab-patronictl list
Patroni-Main Leader at the time of writing the CR was patroni-v12-05-db-gprd.c.gitlab-production.internal
-
Keeping the previous sessions open, monitor the cluster status:
watch -n 1 sudo gitlab-patroni list
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 min
Patroni CI Cluster
-
Open an ssh session to all cluster nodes you will be upgrading and monitor the PostgreSQL logs:
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
-
Execute the Minor Upgrade Playbook for all patroni-ci
nodes ongprd
environment:-
patroni-ci-01-db-gprd.c.gitlab-production.internal -
patroni-ci-02-db-gprd.c.gitlab-production.internal -
patroni-ci-03-db-gprd.c.gitlab-production.internal -
patroni-ci-04-db-gprd.c.gitlab-production.internal -
patroni-ci-05-db-gprd.c.gitlab-production.internal -
patroni-ci-06-db-gprd.c.gitlab-production.internal -
patroni-ci-07-db-gprd.c.gitlab-production.internal -
patroni-ci-08-db-gprd.c.gitlab-production.internal -
patroni-ci-09-db-gprd.c.gitlab-production.internal -
patroni-ci-10-db-gprd.c.gitlab-production.internal
-
PATRONI_NODE=patroni-ci-10-db-gprd.c.gitlab-production.internal
./bin/ansible-exec.sh -e gprd -h $PATRONI_NODE
- Repeat the above procedure for each Patroni CI node, by updating PATRONI_NODE accordingly.
- Document the output for each playbook execution on the change request ticket.
-
Enable Chef Client for patroni-ci
nodes:
./bin/ansible-exec.sh -e gprd -h patroni-ci -p enable_chef
Patroni Main Cluster Replicas
-
Open an ssh session to all cluster nodes and monitor the PostgreSQL logs:
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
The following step assumes patroni-v12-05-db-gprd.c.gitlab-production.internal
is the cluster leader.
Please confirm this is still the case
-
Execute the Minor Upgrade Playbook for all patroni-v12
replica nodes ongprd
environment:-
patroni-v12-01-db-gprd.c.gitlab-production.internal -
patroni-v12-02-db-gprd.c.gitlab-production.internal -
patroni-v12-03-db-gprd.c.gitlab-production.internal -
patroni-v12-04-db-gprd.c.gitlab-production.internal -
patroni-v12-06-db-gprd.c.gitlab-production.internal -
patroni-v12-07-db-gprd.c.gitlab-production.internal -
patroni-v12-08-db-gprd.c.gitlab-production.internal -
patroni-v12-09-db-gprd.c.gitlab-production.internal -
patroni-v12-10-db-gprd.c.gitlab-production.internal
-
PATRONI_NODE=patroni-v12-10-db-gprd.c.gitlab-production.internal
./bin/ansible-exec.sh -e gprd -h $PATRONI_NODE
- Repeat the above procedure for each
patroni-v12
replica node, by updating PATRONI_NODE accordingly. - Document the output for each playbook execution on the change request ticket.
Patroni Main Cluster Leader
The following steps will cause temporary downtime to Gitlab.com
for write related operations
-
Execute the Minor Upgrade Playbook for the Patroni Leader patroni-v12-05-db-gprd.c.gitlab-production.internal
PATRONI_NODE=patroni-v12-05-db-gprd.c.gitlab-production.internal
./bin/ansible-exec.sh -e gstg -h $PATRONI_NODE
-
Document the output for each playbook execution on the change request ticket. -
Enable Chef Client for patroni-v12
nodes:
./bin/ansible-exec.sh -e gprd -h patroni -p enable_chef
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 min
-
Document status of both clusters after upgrade:
sudo gitlab-patronictl list
-
Expire Alert Silence previously created.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 45 min
-
Disable Chef client in all nodes that need to be reverted:
./bin/ansible-exec.sh -e gprd -h patroni -p disable_chef
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1369 -
Execute Rollback playbook on each node group
./bin/ansible-exec.sh -e gprd -h patroni-ci -p rollback
./bin/ansible-exec.sh -e gprd -h patroni -p rollback
-
Enable Chef Client:
./bin/ansible-exec.sh -e gprd -p enable_chef
Monitoring
Key metrics to observe
- Metric: Patroni Service Error Ratio, Patroni Service Apdex
- Location: Patroni Overview
- What changes to this metric should prompt a rollback: Error Ratio Increase, Apdex drop for more than 5 min.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.