[GPRD] [2025-07-25 to 2025-07-29] - Upgrade PostgreSQL to PG17 on CI and Registry

Cluster [`CI`, `Registry`]

Change Summary

The Patroni [Main, CI, Registry, Sec] cluster currently run on Postgres 16. During the maintenance, we will upgrade a new standby cluster to Postgres 17, and update the Consul service registration to point Gitlab.com applications to the corresponding new Patroni cluster running PG 17. In addition to the PostgreSQL upgrade we will upgrade the operating system from Ubuntu 20.04 LTS, which will be EOSS 2025-05, to Ubuntu 22.04 LTS.

Parts of the procedure will there for similar to the previous iteration, but with addition of OS upgrade steps.

Key Benefits:

Up-to-date PostgreSQL engine will provide increased security and stability.
Performance improvements in Postgres 17 will impact our ability to operate the database at our current scale
Ubuntu 20.04 LTS will not security updates anymore after 2025-05-01 and poses a security risk

Upgrade events schedule

Schedule is created based on the planing in OS+DB Upgrade Schedule (internal).

Registry + CI

The CR is executed through the following schedule

All timings are approximate.

Friday 2025-07-25 00:00 Pre OS Upgrade Amcheck
- On the v17 Writer node, run amcheck_collatable_parallel to get a list of corrupted indexes for the reindexing process
Friday 2025-07-25 Release Managers Activity
- Run last PDM
- Run last Deployments
Friday 2025-07-25 23:00Z - PCL start:
- upgrade PCL starts with the standard weekend PCL
- enable block DDL database migrations Feature Flag
- this might have to be delayed due to
  - check there's no running migrations
- perform database upgrade in the target cluster (convert to logical replication old PG16 -> new PG17)
- start amcheck in the PG17 database (check sample of tables and index consistency), which might take 12+ hours to run
Sunday 2025-07-27 06:00Z - Cutover/Switchover Registry:
- Workload switchover to PG17 (PG17 will become the active database)
  - Switchover should be unnoticeable for end-user customers, but there's a small risk of downtime if the automation fails;
- Enable reverse replication (new PG16 -> old PG17)
- Start the rollback window (monitor workload for performance regression in the new engine version)
Sunday 2025-07-27 08:00Z - Cutover/Switchover CI:
- See bullets for previous step
Tuesday 2025-07-29 08:00Z - complete change:
- End rollback window
- Complete change - shut down old cluster
Tuesday 2025-07-29 09:00Z - PCL finish:
- End PCL
- Deploys will resume
Tuesday 2025-07-29 10:00Z - Run the first PDM
- To have enough packages in case of problems
- Normal deployment cadence

Total number of between Start and Finish of PCL lock = 82 hours

Friday 23:00Z to Saturday 23:00Z = 24 hours
Saturday 23:00Z to Sunday 23:00Z = 24 hours
Sunday 23:00Z to Monday 23:00Z = 24 hours
Monday 23:00Z to Tuesday 09:00Z = 10 hours

Downtime Requirements:

The PG17 upgrade process will be an online process with effective zero downtime for end-users, the workload should be enqueue during the switchover that should take a few seconds, but there's a risk of disruption of up to 5 mins if manual intervention is required for the Writer endpoint switchover.

Requirement of Hard PCL for the CR Rollback window

The database upgrade process uses PostgreSQL logical replication to provide effective zero downtime. Unfortunately, logical replication is incompatible with database migrations (model changes) therefore we require an operational lock (PCL) to block deployments in the Production CI database during the whole period of this change.

In the context of the database upgrade CR a rollback window is paramount to allow us to quickly reverted the CR in case any performance regression is caused by the the database engine version. From any external engineer looking through a database reliability engineering perspective, the database upgrade impact could also be interpreted as of a very large atomic deployment, that could be affecting all Ruby "Active Record CRUD operations" within Gitlab's code, as all respective SQL execution plans in the database might be positively or negatively affected by the new engine version.

We will go with last years Standard Weekend PCL (6 hour block pre-upgrade within the weekend PCL) + 33 hours Rollback Window PCL.

Postgres Upgrade Rollout Team

Role	Assigned To
🐺 Coordinator	@rhenchen.gitlab
🔪 Playbook-Runner	@bprescott_
☎️ Comms-Handler	@rmar1
🐬 SRE	@ TBD
🏆 Quality	Self-Serve with escalations via PD
🚑 EOC	TBD
🚒 IMOC	TBD
📣 CMOC	Check CMOC escalation table
🔦 Database Maintainers	TBD
💾 Database Escalation	@rmar1
🚚 Delivery Escalation	`mbursi` `dawsmith`
🎩 Head Honcho	TBD

📣 CMOC Escalation Table

Important: Just for when each window begins - else ping @cmoc on Slack

Date and Step	Assigned To
Friday `2025-07-25 23:00Z` - PCL start	TBD
Saturday `2025-07-26 05:00Z` - Upgrade	TBD
Sunday `2025-07-27 06:00Z` - Cutover/Switchover	TBD
Tuesday `2025-07-29 08:00Z`- PCL finish	TBD

Change Details

Services Impacted - ServicePatroniCI ServicePatroniRegistry Database
Change Technician - @rhenchen.gitlab, @bprescott_
Change Reviewer - @rhenchen.gitlab, @alexander-sosna
Time tracking - 4 days
Downtime Component - Near Zero Customer Downtime (Database migrations need to be blocked)
Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-25 23:00

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

We will pretend that GitLab.com would be unavailable during the execution of the CR and we'll use ops.gitlab.net for the detailed instructions. We will use the following issues:

Rollback Scenarios in case of Incident

The Upgrade CR can be aborted at any time in case of incidents, but we only need to abort the CR if the fix will perform a DDL (database migration), otherwise there's no impact in the upgrade process.

Note: "a DDL (database migration)" it means that a fix that require a schema change through a DDL statement, such as CREATE, ALTER, DROP or REINDEX sql commands;

For any incident a DBRE needs to evaluate the fix to be pushed before deciding what to do with the Upgrade CR.

TODO : We have a DBRE on-call rotation for the upgrade PCL window to assist with evaluating and aborting the CR if necessary during incidents.

There's 3 approaches that might happen in case of an incident that require us to abort the CR:

If the incident is before the Switchover, we'll just abort the CR and the database will remain in PG16;
If the incident is after the Switchover, we can:
- Anticipate the end rollback window but keep the database in PG17;
- Or rollback the upgrade and return the database back to PG16;

In any case the Deployments will resume.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jul 21, 2025 by Rafael Henchen