[GPRD] [2024-11-01 to 11-05] - Upgrade PostgreSQL to v16 on MAIN cluster
Change Summary
The Patroni MAIN cluster currently runs on Postgres 14, an outdated Postgres engine. During the maintenance, we will upgrade a new standby cluster to Postgres 16, and update the Consul service registration to point Gitlab.com applications to the corresponding new Patroni cluster running PG 16.
Key Benefits:
- Up-to-date Postgresqsl engine will provide increased security and stability.
- Performance improvements in Postgres 16 will significantly impact our ability to operate the primary database at our current scale
Upgrade events schedule
The CR is executed through the following schedule
-
Friday 01/11 Release Managers Activity
- 02:00 UTC Run last PDM (Likely APAC hours RMs)
- 04:00 - 14:00 UTC - Run last Deployments as possible.
-
Friday 01/11 23:00 UTC - PCL start:
- upgrade PCL starts with the standard weekend PCL
- enable block DDL database migrations Feature Flag
-
Saturday 02/11 06:00 UTC - Upgrade:
- this might have to be delayed due to
- check there's no running migrations
- perform database upgrade in the target cluster (convert to logical replication old v14 -> new v16)
- start amcheck in the v16 database (check sample of tables and index consistency), which might take 12+ hours to run
- this might have to be delayed due to
-
Sunday 03/11 06:00 UTC - Cutover/Switchover:
-
Workload switchover to v16 (v16 will become the active database)
- Switchover should be unnoticeable for end-user customers, but there's a small risk of downtime if the automation fails;
- Enable reverse replication (new v16 -> old v14)
- Start the rollback window (monitor workload for performance regression in the new engine version)
-
Workload switchover to v16 (v16 will become the active database)
-
Tuesday 05/11 11:00 UTC - PCL finish:
- End rollback window
- End operational lock
- Deploys will resume
-
Wednesday 06/11 09:00 UTC - Run the first PDM
- To have enough packages in case of problems
- Normal deployment cadence
Downtime Requirements:
The PG 16 upgrade process will be an online process with near zero downtime for end-users, the workload should be enqueue during the switchover that should take a few seconds, but there's a risk of downtime of up to 5 mins if manual intervention is required for the Writer endpoint switchover.
Requirement of Hard PCL for the CR Rollback window
The database upgrade process uses PostgreSQL logical replication to provide near zero downtime. Unfortunately, logical replication is incompatible with database migrations (model changes) therefore we require an operational lock (PCL) to block deployments in the Production MAIN database during the whole period of this change.
In the context of the database upgrade CR a rollback window is paramount to allow us to quickly reverted the CR in case any performance regression is caused by the the database engine version. From any external engineer looking through a database reliability engineering perspective, the database upgrade impact could also be interpreted as of a very large atomic deployment, that could be affecting all Ruby "Active Record CRUD operations" within Gitlab's code, as all respective SQL execution plans in the database might be positively or negatively affected by the new engine version.
In comparison with last year we are increasing the PCL window to allow a 48 hour rollback window, however notice that the CR PCL overlaps with the Standard Weekend PCL Window
- Last year we had:
48 hour block pre-upgrade+Standard Weekend PCLand (no rollback window) - This year we are proposing:
Standard Weekend PCL(6 hour block pre-upgrade within the weekend PCL) +33 hours Rollback Window PCL
Postgres Upgrade Rollout Team
| Role | Assigned To |
|---|---|
|
|
@rhenchen.gitlab |
|
|
@daveyleach |
|
|
@rmar1 |
|
|
@rhenchen.gitlab |
|
|
@daveyleach |
|
|
Self-Serve with escalations via PD |
|
|
@nduff and @donnaalexandra (Switchover period), or ping @sre-oncall
|
|
|
@samihiltunen (Switchover period), or ping @imoc
|
|
|
Check CMOC escalation table |
|
|
TBD |
|
|
@rmar1 @alexives @nhxnguyen |
|
|
@mbursi @dawsmith |
|
|
@marin |
📣 CMOC Escalation Table
Important: Just for when each window begins - else ping @cmoc on Slack)
| Date and Step | Assigned To |
|---|---|
| Friday 01/11 23:00 UTC - PCL start | @tristan |
| Saturday 02/11 06:00 UTC - Upgrade | @rotanak |
| Sunday 03/11 06:00 UTC - Cutover/Switchover | @rotanak |
| Tuesday 05/11 11:00 UTC - PCL finish | @tmarsh1 |
Change Details
- Services Impacted - ServicePatroni ServicePostgres Database
- Change Technician - @rhenchen.gitlab @daveyleach
- Change Reviewer - @vitabaks @bshah11 @alexander-sosna
- Time tracking - 3 days
- Downtime Component - Near Zero Customer Downtime (Database migrations need to be blocked)
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
We will pretend that GitLab.com would be unavailable during the execution of the CR and we'll use ops.gitlab.net for the detailed instructions. We will use the following issue: https://ops.gitlab.net/gitlab-com/gl-infra/db-migration/-/issues/72
Rollback Scenarios in case of Incident
The Upgrade CR can be aborted at any time in case of incidents, but we only need to abort the CR if the fix will perform a DDL (database migration), otherwise there's no impact in the upgrade process.
Note: "a DDL (database migration)" it means that a fix that require a schema change through a DDL statement, such as CREATE, ALTER, DROP or REINDEX sql commands;
For any incident a DBRE needs to evaluate the fix to be pushed before deciding what to do with the Upgrade CR.
We have a DBRE on-call rotation for the upgrade PCL window to assist with evaluating and aborting the CR if necessary during incidents.
There's 3 approaches that might happen in case of an incident that require us to abort the CR:
- If the incident is before the Switchover, we'll just abort the CR and the database will remain in v14;
- If the incident is after the Switchover, we can:
- Anticipate the
end rollback windowbut keep the database in v16; - Or rollback the upgrade and return the database back to v14;
- Anticipate the
In any case the Deployments will resume.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.