Handling locked partitions when truncating legacy tables (!108104) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Explaining the problem briefly

We have two rake tasks that can be used to truncate legacy tables. These two rake tasks were initially introduced in this MR

rake gitlab:db:truncate_legacy_tables:main
rake gitlab:db:truncate_legacy_tables:ci

When we ran the rake tasks on GSTG and GPRD to truncate the legacy tables (gitlab_ci tables on main database and gitlab_main tables on ci database) we found a problem that was first discovered on Staging.

Some attached partition security_findings_1 happened to be locked for writes as well, and couldn't be truncated. This is because this table partition used to be a table and then converted into a partition. But the lock-writes triggered were kept on the table on the ci database.

Why did this problem happen?

When we truncate legacy tables, we disable the lock-writes triggers temporarily (within the current transaction) on the partitioned tables, but not on the partitions themselves.

How does this MR solve the problem?

With this MR, we are also disabling the lock-writes triggers temporarily on the table partitions as well.

Why do we lock the writes on the legacy tables anyway?

As part of the database decomposition project we made a copy of the database on a new cluster. We used the first database cluster, called main for all the database tables that belong to the gitlab_main gitlab schema. And used the new database cluster ci for all the tables that belong to the gitlab_schema gitlab_ci. While the gitlab_shared tables are still used on both clusters

After the decomposition we use gitlab_main and gitlab_shared tables on the database cluster main. And we use the gitlab_ci and gitlab_shared tables on the database cluster ci. But the database schema (not to be confused with gitlab_schema) is still the same on both database clusters. We are keeping it now like this for simplicity. That means we still have some legacy tables, that we have been truncated recently, that the application in theory still can write/read from, but it should not. These tables should be ignored. We decided to lock_writes to them, to detect with errors early on that the application is writing to the wrong database.

For example, the application should not write to the users table on the ci database, even though it exists. And the same applies to the ci_builds table on the main database. We should get an error instead.

How to set up and validate locally

Prerequisites

Make sure you are running gitlab locally in decomposed mode: #354065 (closed)
Make sure you have the legacy tables locked for writes by running rake gitlab:db:lock_writes

Instructions

Create a gitlab_main table partition on the ci database by running:

$ gdk psql -d gitlabhq_development_ci -c 'CREATE TABLE gitlab_partitions_dynamic.security_findings_test_partition PARTITION OF security_findings FOR VALUES IN (50000)'

Lock the new partition for writes. We don't usually lock partitions, but we need to simulate a write-locked partition (see the MR description)

$ ./bin/rails runner Gitlab::Database::LockWritesManager.new(table_name: "gitlab_partitions_dynamic.security_findings_test_partition", connection: Ci::ApplicationRecord.connection, database_name: "ci", with_retries: false).lock_writes'

Truncate all the legacy main tables on the ci database. It might take a while, but it should not error.

$ rake 'gitlab:db:truncate_legacy_tables:ci[100]'

Clean up

$ rake gitlab:db:unlock_writes
$ gdk psql -d gitlabhq_development_ci -c 'drop table gitlab_partitions_dynamic.security_findings_test_partition'

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #386759 (closed)

Edited Jan 10, 2023 by Omar Qunsul

Handling locked partitions when truncating legacy tables