[GPRD] [2024-11-01 to 11-05] - Upgrade PostgreSQL to v16 on MAIN cluster

Change Summary

The Patroni MAIN cluster currently runs on Postgres 14, an outdated Postgres engine. During the maintenance, we will upgrade a new standby cluster to Postgres 16, and update the Consul service registration to point Gitlab.com applications to the corresponding new Patroni cluster running PG 16.

Key Benefits:

Up-to-date Postgresqsl engine will provide increased security and stability.
Performance improvements in Postgres 16 will significantly impact our ability to operate the primary database at our current scale

Upgrade events schedule

The CR is executed through the following schedule

Friday 01/11 Release Managers Activity
- 02:00 UTC Run last PDM (Likely APAC hours RMs)
- 04:00 - 14:00 UTC - Run last Deployments as possible.
Friday 01/11 23:00 UTC - PCL start:
- upgrade PCL starts with the standard weekend PCL
- enable block DDL database migrations Feature Flag
Saturday 02/11 06:00 UTC - Upgrade:
- this might have to be delayed due to
  - check there's no running migrations
- perform database upgrade in the target cluster (convert to logical replication old v14 -> new v16)
- start amcheck in the v16 database (check sample of tables and index consistency), which might take 12+ hours to run
Sunday 03/11 06:00 UTC - Cutover/Switchover:
- Workload switchover to v16 (v16 will become the active database)
  - Switchover should be unnoticeable for end-user customers, but there's a small risk of downtime if the automation fails;
- Enable reverse replication (new v16 -> old v14)
- Start the rollback window (monitor workload for performance regression in the new engine version)
Tuesday 05/11 11:00 UTC - PCL finish:
- End rollback window
- End operational lock
- Deploys will resume
Wednesday 06/11 09:00 UTC - Run the first PDM
- To have enough packages in case of problems
- Normal deployment cadence

Downtime Requirements:

The PG 16 upgrade process will be an online process with near zero downtime for end-users, the workload should be enqueue during the switchover that should take a few seconds, but there's a risk of downtime of up to 5 mins if manual intervention is required for the Writer endpoint switchover.

Requirement of Hard PCL for the CR Rollback window

The database upgrade process uses PostgreSQL logical replication to provide near zero downtime. Unfortunately, logical replication is incompatible with database migrations (model changes) therefore we require an operational lock (PCL) to block deployments in the Production MAIN database during the whole period of this change.

In the context of the database upgrade CR a rollback window is paramount to allow us to quickly reverted the CR in case any performance regression is caused by the the database engine version. From any external engineer looking through a database reliability engineering perspective, the database upgrade impact could also be interpreted as of a very large atomic deployment, that could be affecting all Ruby "Active Record CRUD operations" within Gitlab's code, as all respective SQL execution plans in the database might be positively or negatively affected by the new engine version.

In comparison with last year we are increasing the PCL window to allow a 48 hour rollback window, however notice that the CR PCL overlaps with the Standard Weekend PCL Window

Last year we had: 48 hour block pre-upgrade + Standard Weekend PCL and (no rollback window)
This year we are proposing: Standard Weekend PCL (6 hour block pre-upgrade within the weekend PCL) + 33 hours Rollback Window PCL

Postgres Upgrade Rollout Team

Role	Assigned To
🐺 Coordinator	@rhenchen.gitlab
🔪 Playbook-Runner	@daveyleach
☎ Comms-Handler	@rmar1
🐘 DBRE	@rhenchen.gitlab
🐬 SRE	@daveyleach
🏆 Quality	Self-Serve with escalations via PD
🚑 EOC	@nduff and @donnaalexandra (Switchover period), or ping `@sre-oncall`
🚒 IMOC	@samihiltunen (Switchover period), or ping `@imoc`
📣 CMOC	Check CMOC escalation table
🔦 Database Maintainers	TBD
💾 Database Escalation	@rmar1 @alexives @nhxnguyen
🚚 Delivery Escalation	@mbursi @dawsmith
🎩 Head Honcho	@marin

📣 CMOC Escalation Table

Important: Just for when each window begins - else ping @cmoc on Slack)

Date and Step	Assigned To
Friday 01/11 23:00 UTC - PCL start	@tristan
Saturday 02/11 06:00 UTC - Upgrade	@rotanak
Sunday 03/11 06:00 UTC - Cutover/Switchover	@rotanak
Tuesday 05/11 11:00 UTC - PCL finish	@tmarsh1

Change Details

Services Impacted - ServicePatroni ServicePostgres Database
Change Technician - @rhenchen.gitlab @daveyleach
Change Reviewer - @vitabaks @bshah11 @alexander-sosna
Time tracking - 3 days
Downtime Component - Near Zero Customer Downtime (Database migrations need to be blocked)

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

We will pretend that GitLab.com would be unavailable during the execution of the CR and we'll use ops.gitlab.net for the detailed instructions. We will use the following issue: https://ops.gitlab.net/gitlab-com/gl-infra/db-migration/-/issues/72

Rollback Scenarios in case of Incident

The Upgrade CR can be aborted at any time in case of incidents, but we only need to abort the CR if the fix will perform a DDL (database migration), otherwise there's no impact in the upgrade process.

Note: "a DDL (database migration)" it means that a fix that require a schema change through a DDL statement, such as CREATE, ALTER, DROP or REINDEX sql commands;

For any incident a DBRE needs to evaluate the fix to be pushed before deciding what to do with the Upgrade CR.

We have a DBRE on-call rotation for the upgrade PCL window to assist with evaluating and aborting the CR if necessary during incidents.

There's 3 approaches that might happen in case of an incident that require us to abort the CR:

If the incident is before the Switchover, we'll just abort the CR and the database will remain in v14;
If the incident is after the Switchover, we can:
- Anticipate the end rollback window but keep the database in v16;
- Or rollback the upgrade and return the database back to v14;

In any case the Deployments will resume.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Nov 01, 2024 by Donna Alexandra