Explaining the errors that we have got during truncating the legacy tables
Background
As part of #366787 (closed), we wanted to truncate all the legacy tables (gitlab_main
on ci
db, and, gitlab_ci
on main
db)
How the rake task works
This is the rake task that we ran gitlab:db:truncate_legacy_tables:main
and gitlab:db:truncate_legacy_tables:ci
https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/db/truncate_legacy_tables.rake
This task gets the list of the tables sorted by how they are dependent on each other using foreign keys. For simplicity, the tables can be [t1, t2, t3, ... t10]
. Where t1
can be referencing all other tables, where t10
is referenced by any of the previous tables.
The rake task divides the list of the tables into batches to truncate, to avoid high IO problems while truncating. But the problem is users cannot truncate tables t10
without including t1
as well to truncate into the truncate statement again. Even if t1
is empty already.
That's why we had to run the truncate statements in an order that looked like this
TRUNCATE TABLE t1, t2
TRUNCATE TABLE t1, t2, t3, t4
TRUNCATE TABLE t1, t2, t3, t4, t5, t6
This is also explained in the description of the MR that introduced this rake tasks
Errors while truncating the errors.
While truncating the legacy tables on Staging and Production. We have faced some errors related to partitions. So I decided to open this issue to investigate what's going on, to have an explanation, and easy steps to reproduce. Furthermore, because we have an ongoing work to add tooling for tables partitioning, it's good to share these findings here.
These are the change requests for truncating the legacy tables:
- Staging: gitlab-com/gl-infra/production#7585 (closed)
- Production: gitlab-com/gl-infra/production#7770 (closed)
Staging Error
While running the rake truncate task on gitlab_ci
tables on the main
database on staging
, we had this error
rake aborted!
ActiveRecord::StatementInvalid: PG::SREModifyingSqlDataNotPermitted: ERROR: Table: "security_findings_1" is write protected within this Gitlab database.
HINT: Make sure you are using the right database connection
CONTEXT: PL/pgSQL function gitlab_schema_prevent_write() line 4 at RAISE
Here is the explanation for it:
With the rake task gitlab:db:lock_writes
we added triggers
to prevent writes on the legacy tables on the day we did the database decomposition. We add those triggers ONLY on the partitioned tables, but not on the partitions. In order to avoid running these triggers when we truncate the tables, we disable the triggers temporarily using some config designed for this.
security_findings
used to be a non-partitioned tables. But later it was converted into a partition security_findings_1
which is the first partition of this partitioned table. Here is the MR: !93558 (merged). But when this happened, the trigger was kept on that partition table security_findings_1
This caused the error. But in order to get around it, we disabled the trigger on the on the security_findings
table temporarily, and then we re-enabled them again after the truncation.
gitlabhq_production=# ALTER TABLE security_findings DISABLE TRIGGER gitlab_schema_write_trigger_for_security_findings;
ALTER TABLE
We also added this step later as a preparation to truncate the security_findings
table on production as well.
This behaviour can be reproduced by this test table
CREATE TABLE test(value int not null);
CREATE TRIGGER test_trigger BEFORE TRUNCATE on test FOR EACH STATEMENT EXECUTE FUNCTION gitlab_schema_prevent_write();
ALTER TABLE test RENAME to test1;
CREATE TABLE test (LIKE test1 INCLUDING ALL) PARTITION BY RANGE(value);
ALTER TABLE test ATTACH PARTITION test1 FOR VALUES FROM (1) TO (10);
CREATE TABLE test2 PARTITION OF test FOR VALUES FROM (11) TO (20);
If you run \d+ test1
you can see the trigger still on the partition, but it's not listed in \d+ test2
What I propose
This partition security_findings_1
will be dropped later at some point, so there is no need to worry about removing this trigger for now. But in the future, we need to be careful about which triggers we are copying to the partitions that were tables before. We have an EPIC for this tooling: &8476 (closed)
Production Error
On Production we faced another error
I, [2022-10-01T07:12:53.883387 #90805] INFO -- : TRUNCATE TABLE abuse_reports, agent_activity_events, agent_group_authorizations, agent_project_authorizations, alert_management_alert_assignees, alert_management_alert_metric_images, alert_management_alert_user_mentions, alert_management_alerts, alert_management_http_integrations, incident_management_pending_alert_escalations RESTRICT
rake aborted!
ActiveRecord::StatementInvalid: PG::FeatureNotSupported: ERROR: cannot truncate a table referenced in a foreign key constraint
DETAIL: Table "incident_management_pending_alert_escalations_202207" references "alert_management_alerts".
HINT: Truncate table "incident_management_pending_alert_escalations_202207" at the same time, or use TRUNCATE ... CASCADE.
What happened here is related to how to manage the partitions of the tables. When we detach partitions, that are not needed anymore, we "upgrade them" to tables that will be truncated later. The problematic partitions here belong to the MonthlyPartition
strategy. We detach the tables and then drop them some time later. You can see the partitions that are scheduled for deletion in the table detached_partitions
, along with the time to drop them.
The problem is that these tables have foreign keys, and these tables are not included in the TRUNCATE
statement anymore, because we don't do CASCADE
to make sure we don't truncate any needed tables. Our truncation works on the partitions but these tables are not partitions anymore. But they are still scheduled for deletion later.
Having foreign keys on these tables, caused this error. And the only way to get around it was to drop all the foreign keys on these partitions.
What I propose
These partitions are still pending deletion. So there is no action needed. But in any future truncation change requests, this has to be take into consideration. For example, we can make an empty copy of the Staging or Production database (only schema), and truncate it locally first to make sure that we don't have a problem before running the change request.
The Database::DropDetachedPartitionsWorker
will still work to delete all the left-over partitions from both databases. It even deleted the table web_hook_logs_202205
from both of them in September. This can be seen on Thanos here. Look for patroni
and patroni-ci