[GPRD] [2025-07-25 to 2025-07-29] - Upgrade PostgreSQL to PG17 on CI and Registry
CI
, Registry
]
Cluster [Change Summary
The Patroni [Main
, CI
, Registry
, Sec
] cluster currently run on Postgres 16. During the maintenance, we will upgrade a new standby cluster to Postgres 17, and update the Consul service registration to point Gitlab.com applications to the corresponding new Patroni cluster running PG 17. In addition to the PostgreSQL upgrade we will upgrade the operating system from Ubuntu 20.04 LTS, which will be EOSS 2025-05, to Ubuntu 22.04 LTS.
Parts of the procedure will there for similar to the previous iteration, but with addition of OS upgrade steps.
- [GPRD] [2024-11-01 to 11-05] - Upgrade PostgreSQL to v16 on MAIN cluster
- [GPRD] [2024-10-26 to 10-29] - Upgrade PostgreSQL to v16 on CI cluster
- [GPRD] [2024 September 06 to 11] - Upgrade PostgreSQL to v16 on Registry cluster
Key Benefits:
- Up-to-date PostgreSQL engine will provide increased security and stability.
- Performance improvements in Postgres 17 will impact our ability to operate the database at our current scale
- Ubuntu 20.04 LTS will not security updates anymore after 2025-05-01 and poses a security risk
Upgrade events schedule
Schedule is created based on the planing in OS+DB Upgrade Schedule (internal).
Registry + CI
The CR is executed through the following schedule
All timings are approximate.
-
Friday
2025-07-25 00:00
Pre OS Upgrade Amcheck- On the v17 Writer node, run
amcheck_collatable_parallel
to get a list of corrupted indexes for the reindexing process
- On the v17 Writer node, run
-
Friday
2025-07-25
Release Managers Activity- Run last PDM
- Run last Deployments
-
Friday
2025-07-25 23:00Z
- PCL start:- upgrade PCL starts with the standard weekend PCL
- enable block DDL database migrations Feature Flag
- this might have to be delayed due to
- check there's no running migrations
- perform database upgrade in the target cluster (convert to logical replication old PG16 -> new PG17)
- start amcheck in the PG17 database (check sample of tables and index consistency), which might take 12+ hours to run
-
Sunday
2025-07-27 06:00Z
- Cutover/SwitchoverRegistry
:-
Workload switchover to PG17 (PG17 will become the active database)
- Switchover should be unnoticeable for end-user customers, but there's a small risk of downtime if the automation fails;
- Enable reverse replication (new PG16 -> old PG17)
- Start the rollback window (monitor workload for performance regression in the new engine version)
-
Workload switchover to PG17 (PG17 will become the active database)
-
Sunday
2025-07-27 08:00Z
- Cutover/SwitchoverCI
:- See bullets for previous step
-
Tuesday
2025-07-29 08:00Z
- complete change:- End rollback window
- Complete change - shut down old cluster
-
Tuesday
2025-07-29 09:00Z
- PCL finish:- End PCL
- Deploys will resume
-
Tuesday
2025-07-29 10:00Z
- Run the first PDM- To have enough packages in case of problems
- Normal deployment cadence
Total number of between Start and Finish of PCL lock = 82 hours
- Friday 23:00Z to Saturday 23:00Z = 24 hours
- Saturday 23:00Z to Sunday 23:00Z = 24 hours
- Sunday 23:00Z to Monday 23:00Z = 24 hours
- Monday 23:00Z to Tuesday 09:00Z = 10 hours
Downtime Requirements:
The PG17 upgrade process will be an online process with effective zero downtime
for end-users, the workload should be enqueue during the switchover that should take a few seconds, but there's a risk of disruption of up to 5 mins if manual intervention is required for the Writer endpoint switchover.
Requirement of Hard PCL for the CR Rollback window
The database upgrade
process uses PostgreSQL logical replication to provide effective zero downtime
. Unfortunately, logical replication is incompatible with database migrations (model changes) therefore we require an operational lock (PCL) to block deployments in the Production CI
database during the whole period of this change.
In the context of the database upgrade
CR a rollback window is paramount to allow us to quickly reverted the CR in case any performance regression is caused by the the database engine version. From any external engineer looking through a database reliability engineering perspective, the database upgrade impact could also be interpreted as of a very large atomic deployment, that could be affecting all Ruby "Active Record CRUD operations" within Gitlab's code, as all respective SQL execution plans in the database might be positively or negatively affected by the new engine version.
We will go with last years Standard Weekend PCL
(6 hour block pre-upgrade within the weekend PCL) + 33 hours Rollback Window PCL
.
Postgres Upgrade Rollout Team
Role | Assigned To |
---|---|
|
@rhenchen.gitlab |
|
@bprescott_ |
|
@rmar1 |
|
@ TBD |
|
Self-Serve with escalations via PD |
|
TBD |
|
TBD |
|
Check CMOC escalation table |
|
TBD |
|
@rmar1 |
|
mbursi dawsmith
|
|
TBD |
📣 CMOC Escalation Table
Important: Just for when each window begins - else ping @cmoc
on Slack
Date and Step | Assigned To |
---|---|
Friday 2025-07-25 23:00Z - PCL start |
TBD |
Saturday 2025-07-26 05:00Z - Upgrade |
TBD |
Sunday 2025-07-27 06:00Z - Cutover/Switchover |
TBD |
Tuesday 2025-07-29 08:00Z - PCL finish |
TBD |
Change Details
- Services Impacted - ServicePatroniCI ServicePatroniRegistry Database
- Change Technician - @rhenchen.gitlab, @bprescott_
- Change Reviewer - @rhenchen.gitlab, @alexander-sosna
- Time tracking - 4 days
- Downtime Component - Near Zero Customer Downtime (Database migrations need to be blocked)
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-25 23:00
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
We will pretend that GitLab.com would be unavailable during the execution of the CR and we'll use ops.gitlab.net for the detailed instructions. We will use the following issues:
- ServicePatroniRegistry: https://ops.gitlab.net/gitlab-com/gl-infra/db-migration/-/issues/82
- ServicePatroniCI: https://ops.gitlab.net/gitlab-com/gl-infra/db-migration/-/issues/83
Rollback Scenarios in case of Incident
The Upgrade CR can be aborted at any time in case of incidents, but we only need to abort the CR if the fix will perform a DDL (database migration), otherwise there's no impact in the upgrade process.
Note: "a DDL (database migration)" it means that a fix that require a schema change through a DDL statement, such as CREATE, ALTER, DROP or REINDEX sql commands;
For any incident a DBRE needs to evaluate the fix to be pushed before deciding what to do with the Upgrade CR.
TODO : We have a DBRE on-call rotation for the upgrade PCL window to assist with evaluating and aborting the CR if necessary during incidents.
There's 3 approaches that might happen in case of an incident that require us to abort the CR:
- If the incident is before the Switchover, we'll just abort the CR and the database will remain in PG16;
- If the incident is after the Switchover, we can:
- Anticipate the
end rollback window
but keep the database in PG17; - Or rollback the upgrade and return the database back to PG16;
- Anticipate the
In any case the Deployments will resume.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.