2022-02-23 - PostgreSQL Minor Upgrade - Production

Production Change

Change Summary

Upgrade Patroni-Main and Patroni-CI clusters to PostgreSQL 12.9 on gprd environment. We will execute the upgrade, starting with all Patroni-CI and Patroni-Main replicas, finishing up with an upgrade to the Patroni-Main leader which will require application downtime during a restart of the PostgreSQL instance.

Ansible Playbook Code

Related Epic: &625 (closed)

Change Details

Services Impacted - ServicePatroni
Change Technician - @mchacon3
Change Reviewer - @mchacon3
Time tracking - 90 minutes
Downtime Component - YES

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 min

Set label changein-progress on this issue
Create an Alert Silence with the following Matchers:

env="gprd"
type="patroni"
Duration: 90 min

screen

If needed, install Ansible on the Console server:

sudo apt install python3-virtualenv
virtualenv myansible
source myansible/bin/activate
(myansible) $ pip install ansible
(myansible) $ ansible --version

Clone db-migration repo on the console server:

git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git

Check SSH access to the Patroni nodes from the console server:

cd db-migration/pg-upgrade-minor
./bin/ansible-exec.sh -e gprd -p ping

Disable Chef Client in all Patroni nodes:

./bin/ansible-exec.sh -e gprd -p disable_chef

Seek approval for https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1369 and merge.
Login to patroni-v12-01-db-gprd.c.gitlab-production.internal and patroni-ci-01-db-gprd.c.gitlab-production.internal and document current status of the clusters. Identify the leader and standby-leader:

sudo gitlab-patronictl list

Patroni-Main Leader at the time of writing the CR was patroni-v12-05-db-gprd.c.gitlab-production.internal

Keeping the previous sessions open, monitor the cluster status:

 watch -n 1 sudo gitlab-patroni list

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 min

Patroni CI Cluster

Open an ssh session to all cluster nodes you will be upgrading and monitor the PostgreSQL logs:

sudo tail -f /var/log/gitlab/postgresql/postgresql.csv

PATRONI_NODE=patroni-ci-10-db-gprd.c.gitlab-production.internal
./bin/ansible-exec.sh -e gprd -h $PATRONI_NODE

Repeat the above procedure for each Patroni CI node, by updating PATRONI_NODE accordingly.
Document the output for each playbook execution on the change request ticket.
Enable Chef Client for patroni-ci nodes:

./bin/ansible-exec.sh -e gprd -h patroni-ci -p enable_chef

Patroni Main Cluster Replicas

Open an ssh session to all cluster nodes and monitor the PostgreSQL logs:

sudo tail -f /var/log/gitlab/postgresql/postgresql.csv

The following step assumes patroni-v12-05-db-gprd.c.gitlab-production.internal is the cluster leader. Please confirm this is still the case

PATRONI_NODE=patroni-v12-10-db-gprd.c.gitlab-production.internal
./bin/ansible-exec.sh -e gprd -h $PATRONI_NODE

Repeat the above procedure for each patroni-v12 replica node, by updating PATRONI_NODE accordingly.
Document the output for each playbook execution on the change request ticket.

Patroni Main Cluster Leader

The following steps will cause temporary downtime to Gitlab.com for write related operations

Execute the Minor Upgrade Playbook for the Patroni Leader patroni-v12-05-db-gprd.c.gitlab-production.internal

PATRONI_NODE=patroni-v12-05-db-gprd.c.gitlab-production.internal
./bin/ansible-exec.sh -e gstg -h $PATRONI_NODE

Document the output for each playbook execution on the change request ticket.
Enable Chef Client for patroni-v12 nodes:

./bin/ansible-exec.sh -e gprd -h patroni -p enable_chef

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 min

Document status of both clusters after upgrade:

sudo gitlab-patronictl list

Expire Alert Silence previously created.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 45 min

Disable Chef client in all nodes that need to be reverted:

./bin/ansible-exec.sh -e gprd -h patroni -p disable_chef

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1369
Execute Rollback playbook on each node group

./bin/ansible-exec.sh -e gprd -h patroni-ci -p rollback
./bin/ansible-exec.sh -e gprd -h patroni -p rollback

Enable Chef Client:

./bin/ansible-exec.sh -e gprd -p enable_chef

Monitoring

Key metrics to observe

Metric: Patroni Service Error Ratio, Patroni Service Apdex
- Location: Patroni Overview
- What changes to this metric should prompt a rollback: Error Ratio Increase, Apdex drop for more than 5 min.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Feb 22, 2022 by Marcel Chacon