[GSTG][Sec Decomp] Truncate gitlab_sec tables in main (and vice-versa)
Production Change
Change Summary
Now that we have completed the rollout and verification that we won't need to rollback to main, we can safely truncate the old sec tables within patroni-main and the main tables within sec.
Completes Truncate migrated sec tables on gitlab_main (gitlab-org/gitlab#523925 - closed).
For this change request, we want to run the rake task, which was introduced in feat: Support gitlab_sec DB in truncate tables ... (gitlab-org/gitlab!189237 - merged)
gitlab:db:truncate_legacy_tables:main-
gitlab:db:truncate_legacy_tables:sec(newly added)
The following may be useful to diagnose if things are taking too long to truncate a specific table:
- Check if locks exist for table:
select pid from pg_locks l join pg_class t on l.relation = t.oid where t.relkind = 'r' and t.relname = '<TABLE_NAME>';
- Find queries for locks associated with a table:
select pid, state, usename, query, query_start from pg_stat_activity where pid in (select pid from pg_locks l join pg_class t on l.relation = t.oid and t.relkind = 'r' where t.relname = 'ci_builds_metadata');
- Cancel a pid:
SELECT pg_cancel_backend(<PID>);
- Terminate a pid:
SELECT pg_terminate_backend(<PID>);
- Find the longest running queries:
select pid, state, usename, query, query_start from pg_stat_activity order by query_start asc limit 10;
Change Details
- Services Impacted - ServicePostgres ServicePatroni ServicePatroniSec
- Change Technician - @jjsisson
- Change Reviewer - @bshah11 @alexander-sosna @rhenchen.gitlab @bprescott_ @zbraddock
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-26 17:00
- Time tracking - 8h
- Downtime Component - None
Detailed steps for the change
Preparing: Checking any locks on any tables that need to be truncated
- On the Main Database, check the locks on the Sec tables
SELECT pid, relname
FROM pg_locks l
JOIN pg_class t ON l.relation = t.oid
WHERE t.relkind = 'r' AND (
starts_with(t.relname, 'dast_') OR
starts_with(t.relname, 'dependency_list') OR
starts_with(t.relname, 'sbom_') OR
starts_with(t.relname, 'security_') OR
starts_with(t.relname, 'vulnerabilit') OR
t.relname IN ('group_security_exclusions', 'project_security_exclusions', 'project_security_statistics')
) AND NOT (t.relname IN ('dast_profiles_pipelines', 'security_trainings', 'security_training_providers', 'vulnerability_occurrence_pipelines'));
- On the Sec Database, check the locks on any non-Sec tables. Might contain Shared Tables:
SELECT pid, relname
FROM pg_locks l
JOIN pg_class t ON l.relation = t.oid
WHERE t.relkind = 'r' AND NOT (
starts_with(t.relname, 'dast_') OR
starts_with(t.relname, 'dependency_list') OR
starts_with(t.relname, 'sbom_') OR
starts_with(t.relname, 'security_') OR
starts_with(t.relname, 'vulnerabilit') OR
t.relname IN ('group_security_exclusions', 'project_security_exclusions', 'project_security_statistics')
) OR (t.relname IN ('dast_profiles_pipelines', 'security_trainings', 'security_training_providers', 'vulnerability_occurrence_pipelines'));
- If needed, this is the query to list all the locks
SELECT pid, relname from pg_locks l join pg_class t ON l.relation = t.oid where t.relkind = 'r';
Pre-execution steps
-
Make sure all tasks in Change Technician checklist are done -
For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #productionchannel, mention@sre-oncalland this issue and await their acknowledgement.)-
The SRE on-call provided approval with the eoc_approved label on the issue.
-
-
For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #productionchannel, mention@release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents that are severity1 or severity2 -
If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Change steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Running during low utilization hours due to large amounts of IO from truncate. -
Set label changein-progress /label ~change::in-progress -
Merge console-truncate MR -
Validate there are definitely no queries for on the gitlab_sectables in the Main database-
Sequential Tuple Reads should be 0 -
Index tuple reads should be 0
-
-
Validate there are definitely no queries for gitlab_maintables in the Sec database. The queries are grouped byrelnameotherwise the results list is big to show.-
Sequential Tuple Reads should be 0 -
Index Tuple Reads should be 0
-
-
Connect to the main console ssh ${USER}-rails@console-01-sv-gstg.c.gitlab-staging-1.internal -
Lock the writes on all the newly added tables: -
DRY_RUN=true ./bin/rake gitlab:db:lock_writes, to print the tables that will be locked. -
./bin/rake gitlab:db:lock_writesto lock the tables.
-
-
We can introduce some time here (X Minutes) to monitor and see that there are no errors from locked tables.
- Monitor Sentry for any noticeable new errors that are related to database.
- Monitor PostgreSQL logs. By searching for
gitlab_schema_prevent_write.
-
Generate temporary password for truncate_useron your local machineexport TEMPPW=$(pwgen 64 1)
-
Create truncate_useronpatroni-mainandpatroni-seccluster leaders (Double check current leaders)ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal-
sudo gitlab-psqlCREATE ROLE truncate_user LOGIN PASSWORD '${TEMPPW}'; GRANT CONNECT ON DATABASE gitlabhq_production TO truncate_user; GRANT USAGE ON SCHEMA public TO truncate_user; GRANT USAGE ON SCHEMA gitlab_partitions_dynamic TO truncate_user; COPY (SELECT 'GRANT TRUNCATE ON ' || n.nspname || '.' || c.relname || ' TO truncate_user;' FROM pg_class c, pg_namespace n WHERE n.oid = c.relnamespace and c.oid in (SELECT tgrelid FROM pg_trigger WHERE tgname LIKE 'gitlab_schema_write_trigger_for_%') ORDER BY n.nspname;) TO /tmp/truncate_permissions.sql - Verify /tmp/truncate_permissions.sql looks correct (sec tables to be truncated in main
- Update permissions for /tmp/truncate_permissions.sql
sudo chown gitlab-psql:gitlab-psql /tmp/truncate_permissions.sql - Apply
truncate_userpermissions on mainsudo gitlab-psql -f /tmp/truncate_permissions.sql -
ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internalCREATE ROLE truncate_user LOGIN PASSWORD '${TEMPPW}'; GRANT CONNECT ON DATABASE gitlabhq_production TO truncate_user; GRANT USAGE ON SCHEMA public TO truncate_user; GRANT USAGE ON SCHEMA gitlab_partitions_dynamic TO truncate_user; COPY (SELECT 'GRANT TRUNCATE ON ' || n.nspname || '.' || c.relname || ' TO truncate_user;' FROM pg_class c, pg_namespace n WHERE n.oid = c.relnamespace and c.oid in (SELECT tgrelid FROM pg_trigger WHERE tgname LIKE 'gitlab_schema_write_trigger_for_%') ORDER BY n.nspname;) TO /tmp/truncate_permissions.sql - Verify /tmp/truncate_permissions.sql looks correct (sec tables to be truncated in main
- Update permissions for /tmp/truncate_permissions.sql
sudo chown gitlab-psql:gitlab-psql /tmp/truncate_permissions.sql - Apply
truncate_userpermissions on secsudo gitlab-psql -f /tmp/truncate_permissions.sql
-
Connect to console-truncate host OS once it is available ssh ${USER}@console-truncate-01-sv-gstg.c.gitlab-staging-1.internal
-
Disable chef on console-truncate sudo chef-client-disable 'CR #19700'
-
Update database.yml main and sec to contain user truncate_userand the generated temporary passwordvi /var/opt/gitlab/gitlab-rails/etc/database.yml
-
Connect to console-truncate rails session ssh ${USER}_rails@console-truncate-01-sv-gstg.c.gitlab-staging-1.internal
-
Verify the correct user in being used -
ActiveRecord::Base.connection_db_config.configuration_hash[:username]# should returntruncate_user
-
-
Note disk usage from df:Used ?for patroni-main -
Truncating the (1/2) batches of the gitlab_sectables on the Main database. First command to see which tables are truncated in DRY_MODE:DRY_RUN=true UNTIL_TABLE=sbom_sources ./bin/rake gitlab:db:truncate_legacy_tables:mainUNTIL_TABLE=sbom_sources ./bin/rake gitlab:db:truncate_legacy_tables:main
-
Note disk usage from dfon the Main Patroni database server (should have gone down):Used ?=> reduced by ? -
Truncating the (2/2) batches of the gitlab_sectables on the Main database. First command to see which tables are truncated in DRY_MODE:DRY_RUN=true ./bin/rake gitlab:db:truncate_legacy_tables:main./bin/rake gitlab:db:truncate_legacy_tables:main
-
Note disk usage from dfon the Main Patroni database server (should have gone down):Used ?=> reduced by ? -
Close the Main primary database session. -
Connect to the Sec Patroni database server -
Note disk usage from df:Used ?for patroni-sec -
Truncating the (1/4) batches of the gitlab_maintables on the Sec database. First command to see which tables are truncated in DRY_MODE:DRY_RUN=true UNTIL_TABLE=container_expiration_policies ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]UNTIL_TABLE=container_expiration_policies ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
-
Note disk usage from dfon the Sec Patroni database server (should have gone down):Used ?=> reduced by ? -
Truncating the (2/4) batches of the gitlab_maintables on the Sec database. First command to see which tables are truncated in DRY_MODE:DRY_RUN=true UNTIL_TABLE=index_statuses ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]UNTIL_TABLE=index_statuses ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
-
Note disk usage from dfon the Sec Patroni database server (should have gone down):Used ?=> reduced by ? -
Truncating the (3/4) batches of the gitlab_maintables on the Sec database. First command to see which tables are truncated in DRY_MODE:DRY_RUN=true UNTIL_TABLE=programming_languages ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]UNTIL_TABLE=programming_languages ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
-
Note disk usage from dfon the Sec Patroni database server (should have gone down):Used ?=> reduced by ? -
Truncating the (4/4) batches of the gitlab_maintables on the Sec database. First command to see which tables are truncated in DRY_MODE:DRY_RUN=true ./bin/rake gitlab:db:truncate_legacy_tables:sec[10]./bin/rake gitlab:db:truncate_legacy_tables:sec[10]
-
Note disk usage from dfon the Sec Patroni database server (should have gone down):Used ?=> reduced by ? -
Close the Sec primary database session -
Revert console-truncate MR -
Drop truncate_useron patroni-main and patroni-secssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal 'sudo gitlab-psql -c "DROP ROLE IF EXISTS truncate_user;"'ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal 'sudo gitlab-psql -c "DROP ROLE IF EXISTS truncate_user;"' -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
GSTG - Restore cluster via snapshots. -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: SLAs
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Sustained Apdex drop (especially for rails, patroni, patroni-sec, or sidekiq)
- Metric: Patroni Service Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Sustained Service Error Ratio
- Metric: Patroni-sec Service Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Sustained Service Error Ratio
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
The change plan is technically accurate. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
The change execution window respects the Production Change Lock periods. -
For C1 and C2 change issues, the change event is added to the GitLab Production calendar. -
For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagersin this issue to request approval and provide visibility to all infrastructure managers. -
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #productionchannel, mention@release-managersand this issue and await their acknowledgment.)