[PRDSUB] Migrate to CloudSQL

Production Change

Change Summary

A change to migrate production CustomersDot from a self-hosted database to a CloudSQL one. Main issue: https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/152

Change Details

  1. Services Impacted - ServiceCustomersDot
  2. Change Technician - @ahmadsherif
  3. Change Reviewer - @ebaque
  4. Time tracking - 60 minutes
  5. Downtime Component - YES

Detailed steps for the change

Pre-Change Steps - steps to take before executing the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 15 minutes

  • Set label changein-progress /label ~change::in-progress
  • Set label ~"production::blocker" in gitlab-org/customers-gitlab-com#5133 (closed) /label ~"production::blocker"
  • Set label ~"provisioning blocker" in https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165 /label ~"provisioning blocker"
  • Create a silence for blackbox-related alerts.
  • Log into customers-01-inf-prdsub
    • ssh customers-01-inf-prdsub.c.gitlab-subscriptions-prod.internal
  • Start a tmux session
    • tmux
  • Set up .pgass
    • echo "*:*:*:customersdot_postgresql_user:$(sudo grep 'password: ' /home/customersdot/CustomersDot/current/config/database.yml | cut -d\" -f2)" >> ~/.pgpass
    • echo '10.188.0.2:*:*:default:<value-from-pre-steps>' >> ~/.pgpass
    • chmod 0600 ~/.pgpass
  • Set needed environment variables
    • export CLSQL_IP="10.188.0.2"
  • Stop Puma and Sidekiq
    • sudo systemctl stop puma && sudo systemctl stop sidekiq
  • Check if no DB connections are lingering
    • sudo -u postgres psql
    • \x
    • SELECT * FROM pg_stat_activity;
    • Terminate lingering connections, if any
      • SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = 'CustomersDot_production' AND pid <> pg_backend_pid();
    • exit
  • List all tables with their row count, for comparison after the migration
    • psql -h localhost -U customersdot_postgresql_user CustomersDot_production -c "select table_schema, table_name, (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count from (select table_name, table_schema, query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count from information_schema.tables where table_schema = 'public') t order by 2;" > /tmp/before-migration
    • Adapted from https://stackoverflow.com/a/38684225/1856239
  • Dump the database
    • pg_dump -h localhost -U customersdot_postgresql_user --format=directory --no-owner --no-acl CustomersDot_production --file=db-dump -j8
  • Import the roles
    • sudo -u postgres pg_dumpall -r | sed 's/NOSUPERUSER//g' | psql -b -h $CLSQL_IP -U default postgres
  • Shutdown the database
    • sudo systemctl stop system-postgresql.slice
  • Create the database
    • psql -b -h $CLSQL_IP -U customersdot_postgresql_user -c 'CREATE DATABASE "CustomersDot_production" WITH TEMPLATE = template0;' postgres
  • Enable pg_trgm extension
    • psql -b -h $CLSQL_IP -U default -c "CREATE EXTENSION IF NOT EXISTS pg_trgm WITH SCHEMA public;" CustomersDot_production
  • Enable pg_stat_statements extension
    • psql -b -h $CLSQL_IP -U default -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements WITH SCHEMA public;" CustomersDot_production
  • Enable pgcrypto extension
    • psql -b -h $CLSQL_IP -U default -c "CREATE EXTENSION IF NOT EXISTS pgcrypto WITH SCHEMA public;" CustomersDot_production
  • Restore the database
    • pg_restore -h $CLSQL_IP -d CustomersDot_production -U customersdot_postgresql_user -j8 db-dump
  • List all tables with their row count after migration
    • psql -h $CLSQL_IP -U customersdot_postgresql_user CustomersDot_production -c "select table_schema, table_name, (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count from (select table_name, table_schema, query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count from information_schema.tables where table_schema = 'public') t order by 2;" > /tmp/after-migration
  • Check if there is any difference between the two databases
    • diff -u /tmp/{before,after}-migration
    • No diff is good, otherwise halt and re-assess
  • Remove blocking label from https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165
  • Merge https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/merge_requests/329 and trigger a deploy
    • This will start Puma and Sidekiq again
  • Remove .pgpass
    • rm ~/.pgpass
  • Expire the silence created earlier
  • Set label changecomplete /label ~change::complete
  • Close https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/165 and gitlab-org/customers-gitlab-com#5133 (closed)

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Ahmad Sherif