Skip to content

Avoiding sub-transactions in automatic locking writes - DB migrations

What does this MR do and why?

Addressing: #385446 (closed). Original issue was reported here on gprd #385397 (closed)

The workflow of the new functionality

graph TD
    A[Lock Writes Rake Task] --> |with_retries = true| B[Lock Writes Manager]
    C{create_table DB Migration} --> |transactional? <br> with_retries = false| B[Lock Writes Manager]
    C{create_table DB Migration} --> |not transactional? <br> with_retries = true| B[Lock Writes Manager]
    B[Lock Writes Manager] --> |with_retries = true| D[WithLockRetries]
    D --> E[Execute Lock Writes on Table]
    B --> |with_retries = false| E

Information about Decomposed Database and Locking Writes on Legacy Tables (Skip if you are familiar with this)

As part of the database decomposition project we made a copy of the database on a new cluster. We used the first database cluster, called main for all the database tables that belong to the gitlab_main gitlab schema. And used the new database cluster ci for all the tables that belong to the gitlab_schema gitlab_ci. While the gitlab_shared tables are still used on both clusters

After the decomposition we use gitlab_main and gitlab_shared tables on the database cluster main. And we use the gitlab_ci and gitlab_shared tables on the database cluster ci. But the database schema (not to be confused with gitlab_schema) is still the same on both database clusters. We are keeping it now like this for simplicity. That means we still have some legacy tables, that we have been truncated in October 2022 on Production, that the application in theory still can write/read from, but it should not. These tables should be ignored. We decided to lock_writes to them, to detect with errors early on that the application is writing to the wrong database.

For example, the application should not write to the users table on the ci database, even though it exists. And the same applies to the ci_builds table on the main database. We should get an error instead if this happens. That's why we lock write on the tables.

Until now we have to rake task that would lock writes to these legacy tables (CI tables in the Main database and Main tables in the CI database). The rake task is gitlab:db:lock_writes. We ran this task in production after executing CI decomposition but it was a once off task. And we ran it again once as part of the change request gitlab-com/gl-infra/production#7770 (closed). Since then we’ve created new tables and they aren’t locked yet. As such there might be some gitlab_ci tables in main, and gitlab_main tables on ci, and which can still be written to but they should not be. The merged MR takes our original approach to locking and expands it to lock newly created tables by hooking into our DB migrations to detect newly created tables. But this functionality is still controlled by a feature flag automatic_lock_writes_on_table which was turned on for short time and caused a problem that is addressed by this MR.

What is the problem that is addressed by this MR?

When the FF has been enabled on production on 12th December, and a merge request that contains a new table database migration, it has caused a sub-transaction due to calling the lock_retries twice.

This problem was not detected or known when !99287 (merged) was developed, merged and the FF automatic_lock_writes_on_table was enabled. It's disabled now until we merge this MR, and have it deployed on staging and production.

How was the problem mitigated when it was reported?

The FF has been reverted to false on both environments staging and production to mitigate the issue, until this one is resolved. The FF is already set to false by default for any other environments.

With this feature flag set to false, all database migrations in version 2.1 will behave exactly as the previous version 2.0. That means newly created tables are not locked for writes, and the LockWritesManager is not even called.

How is this Merge Request solving the problem?

  1. Adding support to the Gitlab::Database::LockWritesManager to run in two different modes: A. Using with_retries helper that opens a transaction. B. Without with_retries, if the LockWritesManager is running as a part of a database transaction already. For example, if it was called from the DDL database migration that already is encapsulated within a database transaction.

  2. We call Gitlab::Database::LockWritesManager with with_retries set to false from the transactional database migrations (without disable_ddl_transactions!) that creates a new table. In any other place, we use with_retries set to true, for example when we call the rake task rake gitlab:db:lock_writes which locks writes on any legacy table, or from non-transactional database migrations. Tests have been included to cover both cases of database migrations.

What are we doing regarding the legacy tables that haven't been locked for writes, but should be locked?

There is a follow up issue to introduce a change request: #384852 (closed) As part of this change request, we will ask a DBRE to run the rake task rake gitlab:db:lock_writes manually to lock all the tables that haven't been locked.

How to test and validate this change locally?

Click to expand

Testing the old lock_tables rake task

Make sure you are running on a decomposed database

It should lock the CI tables on the main database, and raise an error if it tries to lock them.

rake gitlab:db:unlock_writes
rake gitlab:db:lock_writes
gdk psql -c 'delete from ci_builds'

You should get the error

ERROR:  Table: "ci_builds" is write protected within this Gitlab database.
HINT:  Make sure you are using the right database connection
CONTEXT:  PL/pgSQL function gitlab_schema_prevent_write() line 4 at RAISE

Clean up

rake gitlab:db:unlock_writes

Testing with a Transactional DB Migration (the default)

./bin/rails runner 'Feature.enable(:automatic_lock_writes_on_table)' bin/rails generate model TestLockWritesItem title:string body:text

Create this file 'db/docs/test_lock_writes_items.yml' with the following content. It's this file that will decide the gitlab_schema of the newly created table

---
table_name: test_lock_writes_items
classes:
- TestLockWritesItem
feature_categories:
- users
description: TODO
introduced_by_url: '-'
milestone: "<6.0"
gitlab_schema: gitlab_main

rake db:migrate:ci # it should work and lock the writes on the new table on the ci database

gdk psql -d 'gitlabhq_development_ci' -c 'delete from test_lock_writes_items'

You should get the error

ERROR:  Table: "ci_builds" is write protected within this Gitlab database.
HINT:  Make sure you are using the right database connection
CONTEXT:  PL/pgSQL function gitlab_schema_prevent_write() line 4 at RAISE

rake db:rollback:ci

Testing with a Transactional DB Migration (the default)

  • Change the created database migration by making it a non-transactional db/migrate/<DATETIME>_create_test_lock_writes_items.rb by commenting out the line disable_ddh_transaction!

  • Run rake db:migrate. You should see this in the output

I, [2022-12-21T19:07:47.369805 #48836]  INFO -- : Database: 'ci', Table: 'test_lock_writes_items': Lock Writes
I, [2022-12-21T19:07:47.369844 #48836]  INFO -- : Locking the writes on the table using with_retries

Clean up

rake db:rollback:ci
bin/rails d model TestLockWritesItem title:string body:text
rm db/docs/test_lock_writes_items.yml

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #385446 (closed)

Edited by Omar Qunsul

Merge request reports