stgsub and stgsub-ref: Upgrade CloudSQL Instance Customers-DB from Postgres 12 to Postgres 16

Production Change

Upgrade CloudSQL Instance for Customers-DB from Postgres 12 to Postgres 16 in gitlab-subscriptions-staging and gitlab-subscriptions-stg-ref

This is a precursor to Upgrade CloudSQL Instance for Customers-DB from Postgres 12 to Postgres 16 in gitlab-subscriptions-prod on UTC 2025-06-08 05:00:00

Services Impacted - Service::CustomersDot
Change Technician -@zbraddock @vitallium
Change Reviewer - @jjsisson @rhenchen.gitlab
Scheduled Date and Time - UTC 2025-05-29 05:00:00
Time tracking - 60 minutes
Downtime Component - 20 minutes of downtime for stgsub-ref and stgsub. However, we have already confirmed with the Fulfilment team on multiple occasions that we can have downtime in stgsub-ref and stgsub without negative effects. If we need to rollback, this will add another 25min of downtime.

The reason we want to take downtime is because our zero-downtime process will not work for CloudSQL, because CloudSQL does not support pgbouncers.

Not relevant for staging and stg-ref

Estimated Time to Complete (mins) - 60 minutes

Set label changein-progress /label ~change::in-progress
Create ~production::blocker and ~provision::blocker issues in CustomersDot and Ansible projects to block deployments.
Create a new CloudSQL Instance at version Postgres 12 for each environment so it is available for rollback if needed. Takes approx 15min to spinup an empty CloudSql instance

Merge https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10931. Atlantis will timeout, but that is ok because the CloudSQL instances should still be created. This MR will create the rollback instance for production as well - that is ok, I'd rather have this created well in advance of 8th June anyway.

gitlab-subscriptions-stg-ref

gitlab-subscriptions-staging

Cleanup

Destroy the rollback CloudSQL Instance at version Postgres 12 for each environment, as we can no longer rollback.

Set label changecomplete /label ~change::complete
Close ~production::blocker and ~provision::blocker issues in CustomersDot and Ansible projects to unblock deployments.

Estimated Time to Complete (mins) - 25min

Once we have run Run customers:maintenance_mode[off] Rake task to disable the maintenance mode we can no longer rollback that environment. This was extensively discussed and the risk agreed to here: gitlab-com/gl-infra/data-access/dbo/dbo-issue-tracker#343 (comment 2431510003)

Metric: Puma
- Location: https://dashboards.gitlab.net/d/customersdot-main/customersdot-overview?orgId=1
- What changes to this metric should prompt a rollback: Puma reports an unrecoverable error or a Ruby error.
Metric: Sidekiq
- Location: https://dashboards.gitlab.net/d/customersdot-main/customersdot-overview?orgId=1
- What changes to this metric should prompt a rollback: Sidekiq reports an unrecoverable error or a Ruby error.
Metric: Health check
- Location: https://customersdot.cloudwatch.net/dashboard
- What changes to this metric should prompt a rollback: If both Puma and Sidekiq are running but we still receive health error.

Edited May 30, 2025 by Vitaly Slobodin