[GSTG] [2025-07-02 to 2025-07-03] - Upgrade PostgreSQL to PG17 on Registry

Cluster [`Registry`]

Change Summary

The Patroni [Main, CI, Registry, Sec] cluster currently run on Postgres 16. During the maintenance, we will upgrade a new standby cluster to Postgres 17, and update the Consul service registration to point Gitlab.com applications to the corresponding new Patroni cluster running PG 17. In addition to the PostgreSQL upgrade we will upgrade the operating system from Ubuntu 20.04 LTS, which will be EOSS 2025-05, to Ubuntu 24.04 LTS.

Parts of the procedure will there for similar to the previous iteration, but with addition of OS upgrade steps.

Key Benefits:

Up-to-date PostgreSQL engine will provide increased security and stability.
Performance improvements in Postgres 17 will impact our ability to operate the database at our current scale
Ubuntu 20.04 LTS will not security updates anymore after 2025-05-01 and poses a security risk

Upgrade events schedule

Schedule is created based on the planing in OS+DB Upgrade Schedule (internal).

Registry

The CR is executed through the following schedule

NB: Timings are approximate and worst case scenario. Because there is substantially less data in STG, timings are expected to actually be less than the worst case outlined below.

Wednesday 2025-07-02 Container Registry Activity
- Stop future deployments and all DDL
Wednesday 2025-07-02 06:00Z - Registry-only-PCL start:
Wednesday 2025-07-02 10:00Z - Upgrade:
- this might have to be delayed due to
  - check there's no running migrations
- perform database upgrade in the target cluster (convert to logical replication old PG16 -> new PG17)
- start amcheck in the PG17 database (check sample of tables and index consistency), which might take 12+ hours to run
Wednesday 2025-07-02 12:00Z - Cutover/Switchover:
- Workload switchover to PG17 (PG17 will become the active database)
  - Switchover should be unnoticeable for end-user customers, but there's a small risk of downtime if the automation fails;
- Enable reverse replication (new PG16 -> old PG17)
- Start the rollback window (monitor workload for performance regression in the new engine version)
Thursday 2025-07-03 18:00Z (Tis day will be needed if new tasks emerge during the execution) - PCL finish:
- End rollback window
- End operational lock
- Deploys will resume

wntime Requirements:

The PG17 upgrade process will be an online process with effective zero downtime for end-users, the workload should be enqueue during the switchover that should take a few seconds, but there's a risk of disruption of up to 5 mins if manual intervention is required for the Writer endpoint switchover. This will only affect Container Registry.

Requirement of Registry-only-PCL for the CR Rollback window

The database upgrade process uses PostgreSQL logical replication to provide effective zero downtime. Unfortunately, logical replication is incompatible with database migrations (model changes) therefore we require an operational lock (Registry-only-PCL) to block deployments in the Production CI database during the whole period of this change.

In the context of the database upgrade CR a rollback window is paramount to allow us to quickly reverted the CR in case any performance regression is caused by the database engine version.

Postgres Upgrade Rollout Team

Role	Assigned To
🐺 Coordinator	@alexander-sosna
🔪 Playbook-Runner	@bprescott_
☎️ Comms-Handler	@rmar1
🐘 DBRE	Not needed in GSTG
🐬 SRE	Not needed in GSTG
🏆 Quality	Self-Serve with escalations via PD
🚑 EOC	@donnaalexandra
🚒 IMOC	Not needed in GSTG
📣 CMOC	Not needed in GSTG
🔦 Database Maintainers	Not needed in GSTG
💾 Database Escalation	@rmar1
🚚 Delivery Escalation	@mbursi @dawsmith
🎩 Head Honcho	Not needed in GSTG

Change Details

Time tracking

Services Impacted - ServicePatroniCI ServicePostgres Database
Change Technician - @alexander-sosna, @bprescott_
Change Reviewer - @bshah11, @rhenchen.gitlab, @jjsisson
Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-02 06:00
Time tracking - 2 days (1 day active work)
Downtime Component - Near Zero Customer Downtime (Database migrations need to be blocked)

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Pre-execution steps

Make sure all tasks in Change Technician checklist are done
For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- The SRE on-call provided approval with the eoc_approved label on the issue.
For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
There are currently no active incidents that are severity1 or severity2
If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Change steps - steps to take to execute the change

We will pretend that GitLab.com would be unavailable during the execution of the CR and we'll use ops.gitlab.net for the detailed instructions. We will use the following issues:

[GSTG] [2025-07-02 to 2025-07-03] - Upgrade PostgreSQL to PG17 on Registry

Rollback Scenarios in case of Incident

The Upgrade CR can be aborted at any time in case of incidents, but we only need to abort the CR if the fix will perform a DDL (database migration), otherwise there's no impact in the upgrade process.

Note: "a DDL (database migration)" it means that a fix that require a schema change through a DDL statement, such as CREATE, ALTER, DROP or REINDEX sql commands;

For any incident a DBRE needs to evaluate the fix to be pushed before deciding what to do with the Upgrade CR.

For production, we have a DBRE on-call rotation for the upgrade Registry-only-PCL window to assist with evaluating and aborting the CR if necessary during incidents. For non-production environments like GSTG we plan no full on-call coverage. We make sure that >=1 of the change technicians is available Monday morning UTC.

There's 3 approaches that might happen in case of an incident that require us to abort the CR:

If the incident is before the Switchover, we'll just abort the CR and the database will remain in PG16;
If the incident is after the Switchover, we can:
- Anticipate the end rollback window but keep the database in PG17;
- Or rollback the upgrade and return the database back to PG16;

In any case the Deployments will resume.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

The change plan is technically accurate.
This Change Issue is linked to the appropriate Issue and/or Epic
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
The change execution window respects the Production Change Lock periods.
For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers in this issue to request approval and provide visibility to all infrastructure managers.
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production channel, mention @release-managers and this issue and await their acknowledgment.)

Edited Jul 01, 2025 by Alexander Sosna