[GSTG][Sec Decomp] Truncate gitlab_sec tables in main (and vice-versa)

Production Change

Change Summary

Now that we have completed the rollout and verification that we won't need to rollback to main, we can safely truncate the old sec tables within patroni-main and the main tables within sec.

Completes Truncate migrated sec tables on gitlab_main (gitlab-org/gitlab#523925 - closed).

For this change request, we want to run the rake task, which was introduced in feat: Support gitlab_sec DB in truncate tables ... (gitlab-org/gitlab!189237 - merged)

  1. gitlab:db:truncate_legacy_tables:main
  2. gitlab:db:truncate_legacy_tables:sec (newly added)

The following may be useful to diagnose if things are taking too long to truncate a specific table:

  1. Check if locks exist for table:
    1. select pid from pg_locks l join pg_class t on l.relation = t.oid where t.relkind = 'r' and t.relname = '<TABLE_NAME>';
  2. Find queries for locks associated with a table:
    1. select pid, state, usename, query, query_start from pg_stat_activity where pid in (select pid from pg_locks l join pg_class t on l.relation = t.oid and t.relkind = 'r' where t.relname = 'ci_builds_metadata');
  3. Cancel a pid:
    1. SELECT pg_cancel_backend(<PID>);
  4. Terminate a pid:
    1. SELECT pg_terminate_backend(<PID>);
  5. Find the longest running queries:
    1. select pid, state, usename, query, query_start from pg_stat_activity order by query_start asc limit 10;

Change Details

  1. Services Impacted - ServicePostgres ServicePatroni ServicePatroniSec
  2. Change Technician - @jjsisson
  3. Change Reviewer - @bshah11 @alexander-sosna @rhenchen.gitlab @bprescott_ @zbraddock
  4. Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-26 17:00
  5. Time tracking - 8h
  6. Downtime Component - None

Detailed steps for the change

Preparing: Checking any locks on any tables that need to be truncated

  1. On the Main Database, check the locks on the Sec tables
SELECT pid, relname
FROM pg_locks l
JOIN pg_class t ON l.relation = t.oid
WHERE t.relkind = 'r' AND (
  starts_with(t.relname, 'dast_') OR
  starts_with(t.relname, 'dependency_list') OR
  starts_with(t.relname, 'sbom_') OR
  starts_with(t.relname, 'security_') OR
  starts_with(t.relname, 'vulnerabilit') OR        
  t.relname IN ('group_security_exclusions', 'project_security_exclusions', 'project_security_statistics')
) AND NOT (t.relname IN ('dast_profiles_pipelines', 'security_trainings', 'security_training_providers', 'vulnerability_occurrence_pipelines'));
  1. On the Sec Database, check the locks on any non-Sec tables. Might contain Shared Tables:
SELECT pid, relname
FROM pg_locks l
JOIN pg_class t ON l.relation = t.oid
WHERE t.relkind = 'r' AND NOT (
  starts_with(t.relname, 'dast_') OR
  starts_with(t.relname, 'dependency_list') OR
  starts_with(t.relname, 'sbom_') OR
  starts_with(t.relname, 'security_') OR
  starts_with(t.relname, 'vulnerabilit') OR        
  t.relname IN ('group_security_exclusions', 'project_security_exclusions', 'project_security_statistics')
) OR (t.relname IN ('dast_profiles_pipelines', 'security_trainings', 'security_training_providers', 'vulnerability_occurrence_pipelines'));
  1. If needed, this is the query to list all the locks
SELECT pid, relname from pg_locks l join pg_class t ON l.relation = t.oid where t.relkind = 'r';

Pre-execution steps

  • Make sure all tasks in Change Technician checklist are done
  • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • The SRE on-call provided approval with the eoc_approved label on the issue.
  • For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents that are severity1 or severity2
  • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Change steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Running during low utilization hours due to large amounts of IO from truncate.

  • Set label changein-progress /label ~change::in-progress

  • Merge console-truncate MR

  • Validate there are definitely no queries for on the gitlab_sec tables in the Main database

  • Validate there are definitely no queries for gitlab_main tables in the Sec database. The queries are grouped by relname otherwise the results list is big to show.

  • Connect to the main console

    ssh ${USER}-rails@console-01-sv-gstg.c.gitlab-staging-1.internal
  • Lock the writes on all the newly added tables:

    1. DRY_RUN=true ./bin/rake gitlab:db:lock_writes, to print the tables that will be locked.
    2. ./bin/rake gitlab:db:lock_writes to lock the tables.
  • We can introduce some time here (X Minutes) to monitor and see that there are no errors from locked tables.

    1. Monitor Sentry for any noticeable new errors that are related to database.
    2. Monitor PostgreSQL logs. By searching for gitlab_schema_prevent_write.
  • Generate temporary password for truncate_user on your local machine

    • export TEMPPW=$(pwgen 64 1)
  • Create truncate_user on patroni-main and patroni-sec cluster leaders (Double check current leaders)

    • ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal
    • sudo gitlab-psql
      CREATE ROLE truncate_user LOGIN PASSWORD '${TEMPPW}';
      GRANT CONNECT ON DATABASE gitlabhq_production TO truncate_user;
      GRANT USAGE ON SCHEMA public TO truncate_user;
      GRANT USAGE ON SCHEMA gitlab_partitions_dynamic TO truncate_user;
      COPY (SELECT 'GRANT TRUNCATE ON ' || n.nspname || '.' || c.relname || ' TO truncate_user;' FROM pg_class c, pg_namespace n WHERE n.oid = c.relnamespace and c.oid in (SELECT tgrelid FROM pg_trigger WHERE tgname LIKE 'gitlab_schema_write_trigger_for_%') ORDER BY n.nspname;) TO /tmp/truncate_permissions.sql
    • Verify /tmp/truncate_permissions.sql looks correct (sec tables to be truncated in main
    • Update permissions for /tmp/truncate_permissions.sql
      sudo chown gitlab-psql:gitlab-psql /tmp/truncate_permissions.sql
    • Apply truncate_user permissions on main
      sudo gitlab-psql -f /tmp/truncate_permissions.sql
    • ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal
      CREATE ROLE truncate_user LOGIN PASSWORD '${TEMPPW}';
      GRANT CONNECT ON DATABASE gitlabhq_production TO truncate_user;
      GRANT USAGE ON SCHEMA public TO truncate_user;
      GRANT USAGE ON SCHEMA gitlab_partitions_dynamic TO truncate_user;
      COPY (SELECT 'GRANT TRUNCATE ON ' || n.nspname || '.' || c.relname || ' TO truncate_user;' FROM pg_class c, pg_namespace n WHERE n.oid = c.relnamespace and c.oid in (SELECT tgrelid FROM pg_trigger WHERE tgname LIKE 'gitlab_schema_write_trigger_for_%') ORDER BY n.nspname;) TO /tmp/truncate_permissions.sql
    • Verify /tmp/truncate_permissions.sql looks correct (sec tables to be truncated in main
    • Update permissions for /tmp/truncate_permissions.sql
      sudo chown gitlab-psql:gitlab-psql /tmp/truncate_permissions.sql
    • Apply truncate_user permissions on sec
      sudo gitlab-psql -f /tmp/truncate_permissions.sql
  • Connect to console-truncate host OS once it is available

    • ssh ${USER}@console-truncate-01-sv-gstg.c.gitlab-staging-1.internal
  • Disable chef on console-truncate

    • sudo chef-client-disable 'CR #19700'
  • Update database.yml main and sec to contain user truncate_user and the generated temporary password

    • vi /var/opt/gitlab/gitlab-rails/etc/database.yml
  • Connect to console-truncate rails session

    • ssh ${USER}_rails@console-truncate-01-sv-gstg.c.gitlab-staging-1.internal
  • Verify the correct user in being used

    • ActiveRecord::Base.connection_db_config.configuration_hash[:username] # should return truncate_user
  • Note disk usage from df: Used ? for patroni-main

  • Truncating the (1/2) batches of the gitlab_sec tables on the Main database. First command to see which tables are truncated in DRY_MODE:

    1. DRY_RUN=true UNTIL_TABLE=sbom_sources ./bin/rake gitlab:db:truncate_legacy_tables:main
    2. UNTIL_TABLE=sbom_sources ./bin/rake gitlab:db:truncate_legacy_tables:main
  • Note disk usage from df on the Main Patroni database server (should have gone down): Used ? => reduced by ?

  • Truncating the (2/2) batches of the gitlab_sec tables on the Main database. First command to see which tables are truncated in DRY_MODE:

    1. DRY_RUN=true ./bin/rake gitlab:db:truncate_legacy_tables:main
    2. ./bin/rake gitlab:db:truncate_legacy_tables:main
  • Note disk usage from df on the Main Patroni database server (should have gone down): Used ? => reduced by ?

  • Close the Main primary database session.

  • Connect to the Sec Patroni database server

  • Note disk usage from df: Used ? for patroni-sec

  • Truncating the (1/4) batches of the gitlab_main tables on the Sec database. First command to see which tables are truncated in DRY_MODE:

    1. DRY_RUN=true UNTIL_TABLE=container_expiration_policies ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
    2. UNTIL_TABLE=container_expiration_policies ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
  • Note disk usage from df on the Sec Patroni database server (should have gone down): Used ? => reduced by ?

  • Truncating the (2/4) batches of the gitlab_main tables on the Sec database. First command to see which tables are truncated in DRY_MODE:

    1. DRY_RUN=true UNTIL_TABLE=index_statuses ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
    2. UNTIL_TABLE=index_statuses ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
  • Note disk usage from df on the Sec Patroni database server (should have gone down): Used ? => reduced by ?

  • Truncating the (3/4) batches of the gitlab_main tables on the Sec database. First command to see which tables are truncated in DRY_MODE:

    1. DRY_RUN=true UNTIL_TABLE=programming_languages ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
    2. UNTIL_TABLE=programming_languages ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
  • Note disk usage from df on the Sec Patroni database server (should have gone down): Used ? => reduced by ?

  • Truncating the (4/4) batches of the gitlab_main tables on the Sec database. First command to see which tables are truncated in DRY_MODE:

    1. DRY_RUN=true ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
    2. ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
  • Note disk usage from df on the Sec Patroni database server (should have gone down): Used ? => reduced by ?

  • Close the Sec primary database session

  • Revert console-truncate MR

  • Drop truncate_user on patroni-main and patroni-sec ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal 'sudo gitlab-psql -c "DROP ROLE IF EXISTS truncate_user;"' ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal 'sudo gitlab-psql -c "DROP ROLE IF EXISTS truncate_user;"'

  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • GSTG - Restore cluster via snapshots.
  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

  • Metric: SLAs
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Sustained Apdex drop (especially for rails, patroni, patroni-sec, or sidekiq)
  • Metric: Patroni Service Error Ratio
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Sustained Service Error Ratio
  • Metric: Patroni-sec Service Error Ratio
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Sustained Service Error Ratio

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • The change plan is technically accurate.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • The change execution window respects the Production Change Lock periods.
  • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
  • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers in this issue to request approval and provide visibility to all infrastructure managers.
  • For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
Edited by Jonathon Sisson