Explaining the errors that we have got during truncating the legacy tables

Background

As part of #366787 (closed), we wanted to truncate all the legacy tables (gitlab_main on ci db, and, gitlab_ci on main db)

How the rake task works

This is the rake task that we ran gitlab:db:truncate_legacy_tables:main and gitlab:db:truncate_legacy_tables:ci https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/db/truncate_legacy_tables.rake

This task gets the list of the tables sorted by how they are dependent on each other using foreign keys. For simplicity, the tables can be [t1, t2, t3, ... t10]. Where t1 can be referencing all other tables, where t10 is referenced by any of the previous tables.

The rake task divides the list of the tables into batches to truncate, to avoid high IO problems while truncating. But the problem is users cannot truncate tables t10 without including t1 as well to truncate into the truncate statement again. Even if t1 is empty already.

That's why we had to run the truncate statements in an order that looked like this

TRUNCATE TABLE t1, t2
TRUNCATE TABLE t1, t2, t3, t4
TRUNCATE TABLE t1, t2, t3, t4, t5, t6

This is also explained in the description of the MR that introduced this rake tasks

Errors while truncating the errors.

While truncating the legacy tables on Staging and Production. We have faced some errors related to partitions. So I decided to open this issue to investigate what's going on, to have an explanation, and easy steps to reproduce. Furthermore, because we have an ongoing work to add tooling for tables partitioning, it's good to share these findings here.

These are the change requests for truncating the legacy tables:

Staging: gitlab-com/gl-infra/production#7585 (closed)
Production: gitlab-com/gl-infra/production#7770 (closed)

Staging Error

While running the rake truncate task on gitlab_ci tables on the main database on staging, we had this error

rake aborted!
ActiveRecord::StatementInvalid: PG::SREModifyingSqlDataNotPermitted: ERROR:  Table: "security_findings_1" is write protected within this Gitlab database.
HINT:  Make sure you are using the right database connection
CONTEXT:  PL/pgSQL function gitlab_schema_prevent_write() line 4 at RAISE

Here is the explanation for it:

With the rake task gitlab:db:lock_writes we added triggers to prevent writes on the legacy tables on the day we did the database decomposition. We add those triggers ONLY on the partitioned tables, but not on the partitions. In order to avoid running these triggers when we truncate the tables, we disable the triggers temporarily using some config designed for this.

security_findings used to be a non-partitioned tables. But later it was converted into a partition security_findings_1 which is the first partition of this partitioned table. Here is the MR: !93558 (merged). But when this happened, the trigger was kept on that partition table security_findings_1

This caused the error. But in order to get around it, we disabled the trigger on the on the security_findings table temporarily, and then we re-enabled them again after the truncation.

gitlabhq_production=# ALTER TABLE security_findings DISABLE TRIGGER gitlab_schema_write_trigger_for_security_findings;
ALTER TABLE

We also added this step later as a preparation to truncate the security_findings table on production as well.

This behaviour can be reproduced by this test table

CREATE TABLE test(value int not null);
CREATE TRIGGER test_trigger BEFORE TRUNCATE on test FOR EACH STATEMENT EXECUTE FUNCTION gitlab_schema_prevent_write();
ALTER TABLE test RENAME to test1;
CREATE TABLE test (LIKE test1 INCLUDING ALL) PARTITION BY RANGE(value);
ALTER TABLE test ATTACH PARTITION test1 FOR VALUES FROM (1) TO (10);
CREATE TABLE test2 PARTITION OF test FOR VALUES FROM (11) TO (20);

If you run \d+ test1 you can see the trigger still on the partition, but it's not listed in \d+ test2

What I propose

This partition security_findings_1 will be dropped later at some point, so there is no need to worry about removing this trigger for now. But in the future, we need to be careful about which triggers we are copying to the partitions that were tables before. We have an EPIC for this tooling: &8476 (closed)

Production Error

On Production we faced another error

I, [2022-10-01T07:12:53.883387 #90805]  INFO -- : TRUNCATE TABLE abuse_reports, agent_activity_events, agent_group_authorizations, agent_project_authorizations, alert_management_alert_assignees, alert_management_alert_metric_images, alert_management_alert_user_mentions, alert_management_alerts, alert_management_http_integrations, incident_management_pending_alert_escalations RESTRICT
rake aborted!
ActiveRecord::StatementInvalid: PG::FeatureNotSupported: ERROR:  cannot truncate a table referenced in a foreign key constraint
DETAIL:  Table "incident_management_pending_alert_escalations_202207" references "alert_management_alerts".
HINT:  Truncate table "incident_management_pending_alert_escalations_202207" at the same time, or use TRUNCATE ... CASCADE.

What happened here is related to how to manage the partitions of the tables. When we detach partitions, that are not needed anymore, we "upgrade them" to tables that will be truncated later. The problematic partitions here belong to the MonthlyPartition strategy. We detach the tables and then drop them some time later. You can see the partitions that are scheduled for deletion in the table detached_partitions, along with the time to drop them.

The problem is that these tables have foreign keys, and these tables are not included in the TRUNCATE statement anymore, because we don't do CASCADE to make sure we don't truncate any needed tables. Our truncation works on the partitions but these tables are not partitions anymore. But they are still scheduled for deletion later.

Having foreign keys on these tables, caused this error. And the only way to get around it was to drop all the foreign keys on these partitions.

What I propose

These partitions are still pending deletion. So there is no action needed. But in any future truncation change requests, this has to be take into consideration. For example, we can make an empty copy of the Staging or Production database (only schema), and truncate it locally first to make sure that we don't have a problem before running the change request.

The Database::DropDetachedPartitionsWorker will still work to delete all the left-over partitions from both databases. It even deleted the table web_hook_logs_202205 from both of them in September. This can be seen on Thanos here. Look for patroni and patroni-ci

Edited Oct 18, 2022 by Omar Qunsul