[PRODUCTION] Change request to enable checksums on the production database Cluster

As requested and discussed in the issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11534

The goal is to enable checksums on the PostgreSQL database cluster.

The steps that we need to execute are:

High level summary of the change

Checksums will be enabled on postgres database patroni replica nodes, followed by a switchover from the leader to a checksum-enabled replica, finalized with enabling checksums on the last replica.

Ansible play to enable checksums on patroni replicas: https://ops.gitlab.net/gitlab-com/gl-infra/db-ops/-/blob/master/ansible/playbooks/pg_checksums_enable.yml

Change Details

Services Impacted - Database
Change Technician - Jose Finotto
Change Criticality - C1
Change Type - changeunscheduled
Change Reviewer - Nels Nelson
Due Date - TBD
Time tracking - TBD
Downtime Component - zero

Detailed steps for the change

For the current schedule we are considering the following status from the Patroni cluster:

+-----------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
|     Cluster     |                      Member                     |      Host     |  Role  |  State  | TL | Lag in MB |
+-----------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
| pg11-ha-cluster | patroni-01-db-gprd.c.gitlab-production.internal | 10.220.16.101 |        | running |  4 |       122 |
| pg11-ha-cluster | patroni-02-db-gprd.c.gitlab-production.internal | 10.220.16.102 |        | running |  4 |       233 |
| pg11-ha-cluster | patroni-03-db-gprd.c.gitlab-production.internal | 10.220.16.103 |        | running |  4 |       229 |
| pg11-ha-cluster | patroni-04-db-gprd.c.gitlab-production.internal | 10.220.16.104 |        | running |  4 |       141 |
| pg11-ha-cluster | patroni-05-db-gprd.c.gitlab-production.internal | 10.220.16.105 |        | running |  4 |       237 |
| pg11-ha-cluster | patroni-06-db-gprd.c.gitlab-production.internal | 10.220.16.106 | Leader | running |  4 |         0 |
| pg11-ha-cluster | patroni-07-db-gprd.c.gitlab-production.internal | 10.220.16.107 |        | running |  4 |        59 |
| pg11-ha-cluster | patroni-08-db-gprd.c.gitlab-production.internal | 10.220.16.108 |        | running |  4 |       222 |
+-----------------+-------------------------------------------------+---------------+--------+---------+----+-----------+

In case of any failover we need to re evaluate.

Pre-Change Steps - steps to be completed before the execution of the change:

Create the runbook to analyze and start the troubleshooting in case of data corruption.Issue related
Create the monitoring to alert in case of any data corruption. Issue related
Install in one node manually the binary, in a read-only host, I suggest Patroni-07. - 5 minutes
Execute manually a 'pg_checksums check' to verify the output and if the binary behaves as expected. - 5 minutes
Monitor for 48 hours this host, to verify we do not have any issue related to the package installed.
Install the package from the pg_checksums in all the hosts by chef.
execute this maintenance from a bastion host. We need to have a standard environment to execute the ansible.

Change Steps - steps to take to execute the change

Install the package by Chef in all the database nodes

Now we will execute sequentially the following steps:

Rollback

In case of performance degradation on the primary when we switchover on a secondary that has the checksums enabled, and we do not manage to restore the performance, or find any possible improvement. We could execute a rollback:

Switch back to the old primary that does not have checksums enabled. Time expected: 5 minutes.
Disable the checksums in all the nodes, one by one. This process will take 30 min for each node approx. We have a playbook to execute the disable of the checksums. The playbook is the disable-checksums.yml

Monitoring

Key metrics to observe

We need to monitor the whole database cluster. In special the nodes that we enabled the checksums.

The metrics that we should focus on are:

CPU utilization.
CPU Load
Memory usage
I/O
Context switches

Edited Dec 03, 2020 by Nels Nelson