[PRODUCTION] Change request to enable checksums on the production database Cluster

As requested and discussed in the issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11534

The goal is to enable checksums on the PostgreSQL database cluster.

The steps that we need to execute are:

High level summary of the change

Checksums will be enabled on postgres database patroni replica nodes, followed by a switchover from the leader to a checksum-enabled replica, finalized with enabling checksums on the last replica.

Ansible play to enable checksums on patroni replicas: https://ops.gitlab.net/gitlab-com/gl-infra/db-ops/-/blob/master/ansible/playbooks/pg_checksums_enable.yml

Change Details

  1. Services Impacted - Database
  2. Change Technician - Jose Finotto
  3. Change Criticality - C1
  4. Change Type - changeunscheduled
  5. Change Reviewer - Nels Nelson
  6. Due Date - TBD
  7. Time tracking - TBD
  8. Downtime Component - zero

Detailed steps for the change

For the current schedule we are considering the following status from the Patroni cluster:

+-----------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
|     Cluster     |                      Member                     |      Host     |  Role  |  State  | TL | Lag in MB |
+-----------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
| pg11-ha-cluster | patroni-01-db-gprd.c.gitlab-production.internal | 10.220.16.101 |        | running |  4 |       122 |
| pg11-ha-cluster | patroni-02-db-gprd.c.gitlab-production.internal | 10.220.16.102 |        | running |  4 |       233 |
| pg11-ha-cluster | patroni-03-db-gprd.c.gitlab-production.internal | 10.220.16.103 |        | running |  4 |       229 |
| pg11-ha-cluster | patroni-04-db-gprd.c.gitlab-production.internal | 10.220.16.104 |        | running |  4 |       141 |
| pg11-ha-cluster | patroni-05-db-gprd.c.gitlab-production.internal | 10.220.16.105 |        | running |  4 |       237 |
| pg11-ha-cluster | patroni-06-db-gprd.c.gitlab-production.internal | 10.220.16.106 | Leader | running |  4 |         0 |
| pg11-ha-cluster | patroni-07-db-gprd.c.gitlab-production.internal | 10.220.16.107 |        | running |  4 |        59 |
| pg11-ha-cluster | patroni-08-db-gprd.c.gitlab-production.internal | 10.220.16.108 |        | running |  4 |       222 |
+-----------------+-------------------------------------------------+---------------+--------+---------+----+-----------+

In case of any failover we need to re evaluate.

Pre-Change Steps - steps to be completed before the execution of the change:

  • Create the runbook to analyze and start the troubleshooting in case of data corruption.Issue related

  • Create the monitoring to alert in case of any data corruption. Issue related

  • Install in one node manually the binary, in a read-only host, I suggest Patroni-07. - 5 minutes

  • Execute manually a 'pg_checksums check' to verify the output and if the binary behaves as expected. - 5 minutes

  • Monitor for 48 hours this host, to verify we do not have any issue related to the package installed.

  • Install the package from the pg_checksums in all the hosts by chef.

  • execute this maintenance from a bastion host. We need to have a standard environment to execute the ansible.

Change Steps - steps to take to execute the change

  • Install the package by Chef in all the database nodes

Now we will execute sequentially the following steps:

  • Suggested date: 2020/12/03. Execute the playbook to enable checksums in 1 read-only node. We should start on the node Patroni-08, since does not receive traffic. We should disable the cron-jobs from the snapshot since they will be inconsistent during the time that we are enabling checksums. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replica that the checksums were enabled.

  • Suggested date: 2020/12/07. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-07. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2020/12/09. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-05. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2020/12/11. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-04. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2020/12/14. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-03. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2020/12/16. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-02. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2020/12/17. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-01. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2021/01/03. Create silences. Execute a switchover from the primary to a node that is a secondary with the checksums enabled. I would suggest executing this step at a lower production peak time during the weekend. The main reason here is to reduce the impact on the platform due to the switchover. Currently switchover from the node Patroni-06 to Patroni-01. Time expected: 5 minutes.

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

  • Suggested date: 2021/01/06. Create silences. Execute the playbook to enable checksums in 1 read-only node. We should continue on the node Patroni-06. Time expected to execute the step: 4 - 5 hours

  • Monitor for 48 hours any errors or performance degradation on the replicas that the checksums were enabled. Delete silences.

Rollback

In case of performance degradation on the primary when we switchover on a secondary that has the checksums enabled, and we do not manage to restore the performance, or find any possible improvement. We could execute a rollback:

  • Switch back to the old primary that does not have checksums enabled. Time expected: 5 minutes.

  • Disable the checksums in all the nodes, one by one. This process will take 30 min for each node approx. We have a playbook to execute the disable of the checksums. The playbook is the disable-checksums.yml

Monitoring

Key metrics to observe

We need to monitor the whole database cluster. In special the nodes that we enabled the checksums.

The metrics that we should focus on are:

  • CPU utilization.
  • CPU Load
  • Memory usage
  • I/O
  • Context switches
Edited by Nels Nelson