[PRDSUB] Migrate to CloudSQL
Production Change
Change Summary
A change to migrate production CustomersDot from a self-hosted database to a CloudSQL one. Main issue: https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/152
Change Details
- Services Impacted - ServiceCustomersDot
-
Change Technician -
@ahmadsherif - Change Reviewer - @ebaque
- Time tracking - 60 minutes
- Downtime Component - YES
Detailed steps for the change
Pre-Change Steps - steps to take before executing the change
-
Merge and apply https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4424 -
Get CLoudSQL generated password and private IP: -
tf console -
nonsensitive(module.customers-db.generated_user_password) -
module.customers-db.private_address
-
-
Update postgresql_db_hostvalue in https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/merge_requests/329with the real private IP of the CloudSQL instance -
Communicate the maintenance window via Twitter/Status.io
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 15 minutes
-
Set label changein-progress /label ~change::in-progress -
Set label ~"production::blocker" in gitlab-org/customers-gitlab-com#5133 (closed) /label ~"production::blocker" -
Set label ~"provisioning blocker" in https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165 /label ~"provisioning blocker" -
Create a silence for blackbox-related alerts. -
Log into customers-01-inf-prdsubssh customers-01-inf-prdsub.c.gitlab-subscriptions-prod.internal
-
Start a tmuxsessiontmux
-
Set up .pgassecho "*:*:*:customersdot_postgresql_user:$(sudo grep 'password: ' /home/customersdot/CustomersDot/current/config/database.yml | cut -d\" -f2)" >> ~/.pgpassecho '10.188.0.2:*:*:default:<value-from-pre-steps>' >> ~/.pgpasschmod 0600 ~/.pgpass
-
Set needed environment variables export CLSQL_IP="10.188.0.2"
-
Stop Puma and Sidekiq sudo systemctl stop puma && sudo systemctl stop sidekiq
-
Check if no DB connections are lingering sudo -u postgres psql\xSELECT * FROM pg_stat_activity;- Terminate lingering connections, if any
SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = 'CustomersDot_production' AND pid <> pg_backend_pid();
exit
-
List all tables with their row count, for comparison after the migration psql -h localhost -U customersdot_postgresql_user CustomersDot_production -c "select table_schema, table_name, (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count from (select table_name, table_schema, query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count from information_schema.tables where table_schema = 'public') t order by 2;" > /tmp/before-migration- Adapted from
https://stackoverflow.com/a/38684225/1856239
-
Dump the database pg_dump -h localhost -U customersdot_postgresql_user --format=directory --no-owner --no-acl CustomersDot_production --file=db-dump -j8
-
Import the roles sudo -u postgres pg_dumpall -r | sed 's/NOSUPERUSER//g' | psql -b -h $CLSQL_IP -U default postgres
-
Shutdown the database sudo systemctl stop system-postgresql.slice
-
Create the database psql -b -h $CLSQL_IP -U customersdot_postgresql_user -c 'CREATE DATABASE "CustomersDot_production" WITH TEMPLATE = template0;' postgres
-
Enable pg_trgmextensionpsql -b -h $CLSQL_IP -U default -c "CREATE EXTENSION IF NOT EXISTS pg_trgm WITH SCHEMA public;" CustomersDot_production
-
Enable pg_stat_statementsextensionpsql -b -h $CLSQL_IP -U default -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements WITH SCHEMA public;" CustomersDot_production
-
Enable pgcryptoextensionpsql -b -h $CLSQL_IP -U default -c "CREATE EXTENSION IF NOT EXISTS pgcrypto WITH SCHEMA public;" CustomersDot_production
-
Restore the database pg_restore -h $CLSQL_IP -d CustomersDot_production -U customersdot_postgresql_user -j8 db-dump
-
List all tables with their row count after migration psql -h $CLSQL_IP -U customersdot_postgresql_user CustomersDot_production -c "select table_schema, table_name, (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count from (select table_name, table_schema, query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count from information_schema.tables where table_schema = 'public') t order by 2;" > /tmp/after-migration
-
Check if there is any difference between the two databases diff -u /tmp/{before,after}-migration- No diff is good, otherwise halt and re-assess
-
Remove blocking label from https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165 -
Merge https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/merge_requests/329 and trigger a deploy - This will start Puma and Sidekiq again
-
Remove .pgpassrm ~/.pgpass
-
Expire the silence created earlier -
Set label changecomplete /label ~change::complete -
Close https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165 and gitlab-org/customers-gitlab-com#5133 (closed)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5 minutes
-
Revert https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/merge_requests/329 and trigger a deploy -
Start the database sudo systemctl start system-postgresql.slice
-
Run Puma and Sidekiq sudo systemctl start puma && sudo systemctl start sidekiq
-
Expire the silence created earlier -
Set label changeaborted /label ~change::aborted -
Close https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165 and gitlab-org/customers-gitlab-com#5133 (closed)
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Ahmad Sherif