Skip to content

Check if table truncate task must run

What does this MR do and why?

This MR adds a check that runs if main and ci run on different databases. It verifies that data has been truncated. If not, it will advise to run the rake task for truncation.

After switching to decomposed database, it could happen that a migration fails because data has not been cleaned up. A possible scenario is this:

  • projects table on CI database still has data
  • A DML migration is run for projects table, fixing some duplicate values. Data migrations are only run on main database
  • A DDL migration is run: an unique index is added that is relying on the previous DML migration. DDL migrations are run on both databases. Since the DML migration did not run on CI, adding the index will fail on CI.

So we want to encourage self-managed customers to run the rake task gitlab:db:truncate_legacy_tables:#{database_name} before they upgrade their installation and run migrations.

GitLab provides a pre-upgrade rake task. This task is supposed to run before a self-managed customer upgrades GitLab. In our documentation, we recommend to run sudo gitlab-rake gitlab:check (see https://docs.gitlab.com/ee/update/plan_your_upgrade.html)

Database performance

We check for each to-be-truncated table if it contains data. As soon as we find one table with data, we return 'true'.

So we have two issues to solve:

  1. Most efficient way of determining a table is empty
  2. Most efficient way of checking a lot of tables

Most efficient way of determining a table is empty

Worst case scenario is that a large table in main (projects) has been copied to ci. So we want to verify that projects on ci is empty.

The most naive implementation is SELECT COUNT(*) FROM projects. This will take a long time because it will do a parallel index scan. Example

Another approach is just selecting one random row: SELECT * FROM projects LIMIT 1. Example Much better, 5 ms

Third approach: we do not care about the actual data so SELECT EXISTS( SELECT * FROM projects). Example. Even better, 1ms

Most efficient way of checking a lot of tables

I considered just checking a few known tables, like projects or ci_builds. But that does not rule out that a migration causes issues in another table.

PostgreSQL keeps track of a row estimate, I considered using that but it can still be off. Not a reliable source if we want to be sure.

So I opted for a big UNION query:

(SELECT EXISTS( SELECT * FROM abuse_events ))
UNION
... other tables ...
(SELECT EXISTS( SELECT * FROM zoom_meetings))

If these 600 main tables in ci database do have data, this query finishes in about 2 seconds. I tested this against postgres.ai but unfortunately, the query does not end up in the history.

Screenshots or screen recordings

image

How to set up and validate locally

shards table belongs to main schema so it should be empty on ci. So we add data to shards table:

  1. Unlock tables bundle exec rake gitlab:db:unlock_writes
  2. Insert data in ci database: gdk psql -d gitlabhq_development_ci -c "insert into shards (name) values('test');"
  3. Test the check script bundle exec rake gitlab:check. It should output Tables are truncated? ... no

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #401679 (closed)

Edited by Rutger Wessels

Merge request reports