Ensure (LFK deleted_records) shared model can be migrated safely for decomposed

assigned to @tkuah

changed milestone to %15.10

added Category:Cell Technical Writing automation:ml automation:ml wrong devopsdata stores documentation grouptenant scale sectioncore platform typefeature workflowin review + 1 deleted label

removed automation:ml label

removed automation:ml wrong label

removed documentation label

removed Technical Writing label

removed workflowin review label

removed typefeature label

added typefeature label

changed title from Follow-up from "Document how to migrate to decompose" to Investigate if shared models can be migrated safely

marked the checklist item @tkuah started a discussion: as completed

marked the checklist item @tkuah started a discussion: as incomplete

changed the description

changed milestone to %16.0

unassigned @tkuah

As decomposed is still in Alpha, and our focus is on two connections, I have moved this to %16.0, and un-assigned myself

My concern here is that we will need support from groupdatabase to verify / harden the shared models concept. And that there are at least 17 models to work through.

changed title from Investigate if shared models can be migrated safely to Investigate if shared models can be migrated safely for decomposed

changed title from Investigate if shared models can be migrated safely for decomposed to Ensure shared models can be migrated safely for decomposed

changed the description

db/docs/background_migration_jobs.yml:gitlab_schema: gitlab_shared

db/docs/batched_background_migration_job_transition_logs.yml:gitlab_schema: gitlab_shared

db/docs/batched_background_migration_jobs.yml:gitlab_schema: gitlab_shared

db/docs/batched_background_migrations.yml:gitlab_schema: gitlab_shared

All batched_background_migrations related tables are safe to be copied from main to ci, as when picking up migration for execution we filter by batched_background_migrations.gitlab_schema, which was introduced for the need of decomposing GitLab.com.

How does the code that uses these tables/views work with unexpected data ? Examples:

batched_background_migrations in ci db now has a row with table_name: projects - what happens then ?

batched_background_migration_jobs has the same rows in both ci db and main db. Will there be two batch processes now both having race condition ?

Any duplicated data should be historic, from before the database was decomposed. It's not causing problems, as the two workers we have (one for each of main and ci) are filtering by for_gitlab_schema(Gitlab::Database.gitlab_schemas_for_connection(connection)).

we filter by batched_background_migrations.gitlab_schema

Thanks @krasio - That's good to know about batched_background_migrations, batched_background_migration_jobs and batched_background_migration_job_transition_logs

What about background_migration_jobs ? Will we have race conditions there ?

What about background_migration_jobs ? Will we have race conditions there ?

This table is used for tracking old-style Sidekiq based background migrations, which we do not use much nowadays. I think it should be fine too, as enqueuing such migrations requires DML (SELECT) and the rails migration doing it will have to restrict_gitlab_migration. With single DB both workers (one for main and one for ci) will write in the same table, and after the decomposition and copying the data) each one will use the table from the respective DB.

I assume self-managed decomposition will need downtime right? We can also advise both type of migrations to be disabled upfront using the feature flags, as we did for GitLab.com.

@krasio

I assume self-managed decomposition will need downtime right?

Yes

We can also advise both type of migrations to be disabled upfront using the feature flags, as we did for GitLab.com.

execute_background_migrations for the old background migrations
execute_batched_migrations_on_schedule for the new batched background migrations.

Right?

@OmarQunsulGitlab Yes, these two.

db/docs/views/postgres_autovacuum_activity.yml:gitlab_schema: gitlab_shared

db/docs/views/postgres_constraints.yml:gitlab_schema: gitlab_shared

db/docs/views/postgres_foreign_keys.yml:gitlab_schema: gitlab_shared

db/docs/views/postgres_index_bloat_estimates.yml:gitlab_schema: gitlab_shared

db/docs/views/postgres_indexes.yml:gitlab_schema: gitlab_shared

db/docs/views/postgres_partitioned_tables.yml:gitlab_schema: gitlab_shared

db/docs/views/postgres_partitions.yml:gitlab_schema: gitlab_shared

These are views that I think we only SELECT from, so no need to even copy any data, they should be created when the ci database schema is created/copied.

Thanks @krasio ! So these views are from db tables that won't be copied across (as we only pg_dump gitlabhq_production and restore to gitlabhq_production_ci). ?

If so, I think I can check this off now.

db/structure.sql:CREATE VIEW postgres_autovacuum_activity AS
db/structure.sql-           FROM postgres_pg_stat_activity_autovacuum() postgres_pg_stat_activity_autovacuum(query, query_start)
db/structure.sql-   FROM processes;
db/structure.sql-COMMENT ON VIEW postgres_autovacuum_activity IS 'Contains information about PostgreSQL backends currently performing autovacuum operations on the tables indicated here.';
db/structure.sql:CREATE VIEW postgres_constraints AS
db/structure.sql-           FROM (unnest(pg_constraint.conkey) WITH ORDINALITY attnums(attnum, ordering)
db/structure.sql-   FROM ((pg_constraint
db/structure.sql:CREATE VIEW postgres_foreign_keys AS
db/structure.sql-   FROM ((((((pg_constraint
db/structure.sql-           FROM (unnest(pg_constraint.conkey) WITH ORDINALITY conkey(attnum, idx)
db/structure.sql:CREATE VIEW postgres_index_bloat_estimates AS
db/structure.sql-   FROM ( SELECT COALESCE(((1)::double precision + ceil((rows_hdr_pdg_stats.reltuples / floor((((((rows_hdr_pdg_stats.bs - (rows_hdr_pdg_stats.pageopqdata)::numeric) - (rows_hdr_pdg_stats.pagehdr)::numeric) * (rows_hdr_pdg_stats.fillfactor)::numeric))::double precision / ((100)::double precision * (((4)::numeric + rows_hdr_pdg_stats.nulldatahdrwidth))::double precision)))))), (0)::double precision) AS est_pages_ff,
db/structure.sql-           FROM ( SELECT rows_data_stats.maxalign,
db/structure.sql:CREATE VIEW postgres_indexes AS
db/structure.sql-   FROM ((((pg_index
db/structure.sql:CREATE VIEW postgres_partitioned_tables AS
db/structure.sql-   FROM (((( SELECT pg_partitioned_table.partrelid,
db/structure.sql-           FROM pg_partitioned_table) partitioned_tables
db/structure.sql:CREATE VIEW postgres_partitions AS
db/structure.sql-   FROM ((((pg_class

Yes, I think they are all SELECTs from other service PG tables/views, and we have them for convenience.

db/docs/postgres_async_foreign_key_validations.yml:gitlab_schema: gitlab_shared

db/docs/postgres_async_indexes.yml:gitlab_schema: gitlab_shared

db/docs/postgres_reindex_actions.yml:gitlab_schema: gitlab_shared

db/docs/postgres_reindex_queued_actions.yml:gitlab_schema: gitlab_shared

These are safe to copy, we even want the data duplicated so that the DDL is executed against each database, so they have the same schema.

The DDL execution, that happens on weekends, is behind ops feature flags, so many (or even all) self-managed will not have this enabled.

db/docs/detached_partitions.yml:gitlab_schema: gitlab_shared

db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared

These two I am not very familiar with, but I think we had them before the decomposition of GitLab.com, so I assume it should be safe to have the data duplicated.

@ahegyi @stomlinson What do you think?

@krasio, @tkuah,

It's not 100% safe to have duplicated data in the loose_foreign_keys_deleted_records.

Consider the following use-case:

A pipeline is deleted (pipeline has many builds). A new record is inserted into loose_foreign_keys_deleted_records.
Before the builds are cleaned up, we start the decomposition migration.
We end up duplicated data (ci, main).
The LooseForeignKeys::CleanupWorker cleans up the pipeline and build records only in the CI cluster (since the tables belong to the CI schema).
Data will be left in the loose_foreign_keys_deleted_records table of the main DB.
Sliding partition switching might not work anymore because nothing will clean up the old CI rows in the main DB. (The partition switching logic cleans up partitions when there is nothing to be processed)

To address this problem, we need to manually clean up the loose FK queue before we run the decomposition migration. Something like this:

loop do
  result = ProcessDeletedRecordsService.new(connection: ApplicationRecord.connection).execute
  break if LooseForeignKeys::DeletedRecord.count == 0
end

@ahegyi How did we handle this case when decomposing GitLab.com?

@krasio, If I remember correctly, we shut down the web nodes first to drain the sidekiq queues.

@ahegyi I have been testing this locally by deleting records from users and ci_pipelines on a single database setup, and then copying the database and having a decomposed setup. Here is what I found:

The LooseForeignKeys::CleanupWorker cleans up the pipeline and build records only in the CI cluster (since the tables belong to the CI schema).

Yes, CleanupWorker cleans the records from the relevant database. If it was running on main, it would clean up or nullify all the dependent records from gitlab_main tables, and the same on ci database.

Data will be left in the loose_foreign_keys_deleted_records table of the main DB.

What I found is that all the records in loose_foreign_keys_deleted_records in both databases are marked as with processed: `status = 2', regardless of whether the parent table belongs to that database or not.

To clarify, when the worker runs on main:

It deletes/nullify all the gitlab_main tables that are dependent on users
It deletes/nullify all the gitlab_main tables that are dependent on ci_pipelines
It marks all the loose_foreign_keys_deleted_records with status = 2

To clarify, when the worker runs on ci:

It deletes/nullify all the gitlab_ci tables that are dependent on users
It deletes/nullify all the gitlab_ci tables that are dependent on ci_pipelines
It marks all the loose_foreign_keys_deleted_records with status = 2

I will need to test again to make sure. Maybe I can write some test cases for it as well.

@OmarQunsulGitlab,

What I found is that all the records in loose_foreign_keys_deleted_records in both databases are marked as with processed: `status = 2', regardless of whether the parent table belongs to that database or not.

AFAIK this shouldn't work like that, before switching to the decomposed version, check if you have some records with with status=1. Once booting up the decomposed system, we should see that the worker won't scan CI LFK definitions when processing main. We should see for example that deleted record with table name ci_builds will stay unprocessed in the deleted records table.

@ahegyi

before switching to the decomposed version, check if you have some records with with status=1

Yes I did. Before I started the experiment, I cleaned up the table on the single database. And before I ran the clean up job, I made sure that I had N number of items on both databases LFK_deleted_records, all of them with status=1

Output from my terminal:

omar ~/gdk/gitlab $ gdk psql -d 'gitlabhq_development' -c 'select * from loose_foreign_keys_deleted_records'
 id | partition | primary_key_value | status |          created_at           | fully_qualified_table_name |         consume_after         | cleanup_attempts 
----+-----------+-------------------+--------+-------------------------------+----------------------------+-------------------------------+------------------
 39 |         3 |                70 |      1 | 2023-07-21 10:07:27.325342+00 | public.users               | 2023-07-21 10:07:27.325342+00 |                0
 40 |         3 |                 9 |      1 | 2023-07-21 10:08:09.548927+00 | public.ci_pipelines        | 2023-07-21 10:08:09.548927+00 |                0
(2 rows)

omar ~/gdk/gitlab $ gdk psql -d 'main_copy' -c 'select * from loose_foreign_keys_deleted_records'
 id | partition | primary_key_value | status |          created_at           | fully_qualified_table_name |         consume_after         | cleanup_attempts 
----+-----------+-------------------+--------+-------------------------------+----------------------------+-------------------------------+------------------
 39 |         3 |                70 |      1 | 2023-07-21 10:07:27.325342+00 | public.users               | 2023-07-21 10:07:27.325342+00 |                0
 40 |         3 |                 9 |      1 | 2023-07-21 10:08:09.548927+00 | public.ci_pipelines        | 2023-07-21 10:08:09.548927+00 |                0
(2 rows)

Then after running this snippet in Rails Console

connection = ApplicationRecord.connection; nil
Gitlab::Database::SharedModel.using_connection(connection) do
  LooseForeignKeys::ProcessDeletedRecordsService.new(connection: connection).execute
end

## The previous code was updating only gitlab_main tables

connection = Ci::ApplicationRecord.connection; nil
Gitlab::Database::SharedModel.using_connection(connection) do
  LooseForeignKeys::ProcessDeletedRecordsService.new(connection: connection).execute
end

I was getting this

omar ~/gdk/gitlab $ gdk psql -d 'gitlabhq_development' -c 'select * from loose_foreign_keys_deleted_records'
 id | partition | primary_key_value | status |          created_at           | fully_qualified_table_name |         consume_after         | cleanup_attempts 
----+-----------+-------------------+--------+-------------------------------+----------------------------+-------------------------------+------------------
 40 |         3 |                 9 |      2 | 2023-07-21 10:08:09.548927+00 | public.ci_pipelines        | 2023-07-21 10:08:09.548927+00 |                0
 39 |         3 |                70 |      2 | 2023-07-21 10:07:27.325342+00 | public.users               | 2023-07-21 10:07:27.325342+00 |                0
(2 rows)


omar ~/gdk/gitlab $ gdk psql -d 'main_copy' -c 'select * from loose_foreign_keys_deleted_records'           
 id | partition | primary_key_value | status |          created_at           | fully_qualified_table_name |         consume_after         | cleanup_attempts 
----+-----------+-------------------+--------+-------------------------------+----------------------------+-------------------------------+------------------
 39 |         3 |                70 |      2 | 2023-07-21 10:07:27.325342+00 | public.users               | 2023-07-21 10:07:27.325342+00 |                0
 40 |         3 |                 9 |      2 | 2023-07-21 10:08:09.548927+00 | public.ci_pipelines        | 2023-07-21 10:08:09.548927+00 |                0
(2 rows)

Interesting find, @OmarQunsulGitlab!

You are right, based on the https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/loose_foreign_keys/process_deleted_records_service.rb#L73 file we don't filter out tables based on their schema.

This means that on PRD we're wasting some cycles checking for deleted records for tables which will never going to have records. I guess this can be a problem if we do the decomposition and install triggers. After that step the table cannot get any deletes.

The fix would be to select the tables only belonging to the current schema.

@ahegyi after testing it further, I found out that I didn't list before everything that happened. Here is the full list of things that happen

When the worker runs on main, or on ci. The exact work happens when the items in loose_foreign_keys_deleted_records are the same

It deletes/nullify all the gitlab_main tables on main that are dependent on users
It deletes/nullify all the gitlab_main tables on main that are dependent on ci_pipelines
It deletes/nullify all the gitlab_ci tables on ci that are dependent on users
It deletes/nullify all the gitlab_ci tables on ci that are dependent on ci_pipelines
It marks all the loose_foreign_keys_deleted_records with status = 2 on the database that is running

Basically, when we decompose the database, and start the workers again, we do the duplicate the amount of the work. Which is totally fine, considering that the items are most probably not many, if there was items at all. Because we will most probably tell the customers to stop their web services and drain their Sidekiq workers.

My only concern that we run DELETE or UPDATE for a table on the wrong database, and we end up with getting lock writes error. Because when the database is decomposed, we lock the writes on the tables that are not relevant anymore. But we always choose the connection from the model, regardless of which database the SharedModel is connected to. So everything is perfect

marked the checklist item db/docs/views/postgres_autovacuum_activity.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/views/postgres_constraints.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/views/postgres_foreign_keys.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/views/postgres_index_bloat_estimates.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/views/postgres_indexes.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/views/postgres_partitioned_tables.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/views/postgres_partitions.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/postgres_async_foreign_key_validations.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/postgres_async_indexes.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/postgres_reindex_actions.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/postgres_reindex_queued_actions.yml:gitlab_schema: gitlab_shared as completed

For batched background migrations I think we should document that all migrations are completed before doing the migration to 2 databases. If that's the case then the migration tables will only be containing data related to completed migrations so this might narrow the scope of our investigations.

marked the checklist item db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared as completed

marked the checklist item db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared as incomplete

added to epic &9628

Ensure (LFK deleted_records) shared model can be migrated safely for decomposed

Questions

Designs

Child items ...

Activity