As decomposed is still in Alpha, and our focus is on two connections, I have moved this to %16.0, and un-assigned myself
My concern here is that we will need support from groupdatabase to verify / harden the shared models concept. And that there are at least 17 models to work through.
Thong Kuahchanged title from Investigate if shared models can be migrated safely to Investigate if shared models can be migrated safely for decomposed
changed title from Investigate if shared models can be migrated safely to Investigate if shared models can be migrated safely for decomposed
Thong Kuahchanged title from Investigate if shared models can be migrated safely for decomposed to Ensure shared models can be migrated safely for decomposed
changed title from Investigate if shared models can be migrated safely for decomposed to Ensure shared models can be migrated safely for decomposed
Thong Kuahchanged the descriptionCompare with previous version
All batched_background_migrations related tables are safe to be copied from main to ci, as when picking up migration for execution we filter by batched_background_migrations.gitlab_schema, which was introduced for the need of decomposing GitLab.com.
we filter by batched_background_migrations.gitlab_schema
Thanks @krasio - That's good to know about batched_background_migrations, batched_background_migration_jobs and batched_background_migration_job_transition_logs
What about background_migration_jobs ? Will we have race conditions there ?
What about background_migration_jobs ? Will we have race conditions there ?
This table is used for tracking old-style Sidekiq based background migrations, which we do not use much nowadays. I think it should be fine too, as enqueuing such migrations requires DML (SELECT) and the rails migration doing it will have to restrict_gitlab_migration. With single DB both workers (one for main and one for ci) will write in the same table, and after the decomposition and copying the data) each one will use the table from the respective DB.
I assume self-managed decomposition will need downtime right? We can also advise both type of migrations to be disabled upfront using the feature flags, as we did for GitLab.com.
These are views that I think we only SELECT from, so no need to even copy any data, they should be created when the ci database schema is created/copied.
Thanks @krasio ! So these views are from db tables that won't be copied across (as we only pg_dumpgitlabhq_production and restore to gitlabhq_production_ci). ?
If so, I think I can check this off now.
db/structure.sql:CREATE VIEW postgres_autovacuum_activity ASdb/structure.sql- FROM postgres_pg_stat_activity_autovacuum() postgres_pg_stat_activity_autovacuum(query, query_start)db/structure.sql- FROM processes;db/structure.sql-COMMENT ON VIEW postgres_autovacuum_activity IS 'Contains information about PostgreSQL backends currently performing autovacuum operations on the tables indicated here.';db/structure.sql:CREATE VIEW postgres_constraints ASdb/structure.sql- FROM (unnest(pg_constraint.conkey) WITH ORDINALITY attnums(attnum, ordering)db/structure.sql- FROM ((pg_constraintdb/structure.sql:CREATE VIEW postgres_foreign_keys ASdb/structure.sql- FROM ((((((pg_constraintdb/structure.sql- FROM (unnest(pg_constraint.conkey) WITH ORDINALITY conkey(attnum, idx)db/structure.sql:CREATE VIEW postgres_index_bloat_estimates ASdb/structure.sql- FROM ( SELECT COALESCE(((1)::double precision + ceil((rows_hdr_pdg_stats.reltuples / floor((((((rows_hdr_pdg_stats.bs - (rows_hdr_pdg_stats.pageopqdata)::numeric) - (rows_hdr_pdg_stats.pagehdr)::numeric) * (rows_hdr_pdg_stats.fillfactor)::numeric))::double precision / ((100)::double precision * (((4)::numeric + rows_hdr_pdg_stats.nulldatahdrwidth))::double precision)))))), (0)::double precision) AS est_pages_ff,db/structure.sql- FROM ( SELECT rows_data_stats.maxalign,db/structure.sql:CREATE VIEW postgres_indexes ASdb/structure.sql- FROM ((((pg_indexdb/structure.sql:CREATE VIEW postgres_partitioned_tables ASdb/structure.sql- FROM (((( SELECT pg_partitioned_table.partrelid,db/structure.sql- FROM pg_partitioned_table) partitioned_tablesdb/structure.sql:CREATE VIEW postgres_partitions ASdb/structure.sql- FROM ((((pg_class
These two I am not very familiar with, but I think we had them before the decomposition of GitLab.com, so I assume it should be safe to have the data duplicated.
It's not 100% safe to have duplicated data in the loose_foreign_keys_deleted_records.
Consider the following use-case:
A pipeline is deleted (pipeline has many builds). A new record is inserted into loose_foreign_keys_deleted_records.
Before the builds are cleaned up, we start the decomposition migration.
We end up duplicated data (ci, main).
The LooseForeignKeys::CleanupWorker cleans up the pipeline and build records only in the CI cluster (since the tables belong to the CI schema).
Data will be left in the loose_foreign_keys_deleted_records table of the main DB.
Sliding partition switching might not work anymore because nothing will clean up the old CI rows in the main DB. (The partition switching logic cleans up partitions when there is nothing to be processed)
To address this problem, we need to manually clean up the loose FK queue before we run the decomposition migration. Something like this:
@ahegyi I have been testing this locally by deleting records from users and ci_pipelines on a single database setup, and then copying the database and having a decomposed setup. Here is what I found:
The LooseForeignKeys::CleanupWorker cleans up the pipeline and build records only in the CI cluster (since the tables belong to the CI schema).
Yes, CleanupWorker cleans the records from the relevant database. If it was running on main, it would clean up or nullify all the dependent records from gitlab_main tables, and the same on ci database.
Data will be left in the loose_foreign_keys_deleted_records table of the main DB.
What I found is that all the records in loose_foreign_keys_deleted_records in both databases are marked as with processed: `status = 2', regardless of whether the parent table belongs to that database or not.
To clarify, when the worker runs on main:
It deletes/nullify all the gitlab_main tables that are dependent on users
It deletes/nullify all the gitlab_main tables that are dependent on ci_pipelines
It marks all the loose_foreign_keys_deleted_records with status = 2
To clarify, when the worker runs on ci:
It deletes/nullify all the gitlab_ci tables that are dependent on users
It deletes/nullify all the gitlab_ci tables that are dependent on ci_pipelines
It marks all the loose_foreign_keys_deleted_records with status = 2
I will need to test again to make sure. Maybe I can write some test cases for it as well.
What I found is that all the records in loose_foreign_keys_deleted_records in both databases are marked as with processed: `status = 2', regardless of whether the parent table belongs to that database or not.
AFAIK this shouldn't work like that, before switching to the decomposed version, check if you have some records with with status=1. Once booting up the decomposed system, we should see that the worker won't scan CI LFK definitions when processing main. We should see for example that deleted record with table name ci_builds will stay unprocessed in the deleted records table.
before switching to the decomposed version, check if you have some records with with status=1
Yes I did. Before I started the experiment, I cleaned up the table on the single database.
And before I ran the clean up job, I made sure that I had N number of items on both databases LFK_deleted_records, all of them with status=1
connection = ApplicationRecord.connection; nilGitlab::Database::SharedModel.using_connection(connection) do LooseForeignKeys::ProcessDeletedRecordsService.new(connection: connection).executeend## The previous code was updating only gitlab_main tablesconnection = Ci::ApplicationRecord.connection; nilGitlab::Database::SharedModel.using_connection(connection) do LooseForeignKeys::ProcessDeletedRecordsService.new(connection: connection).executeend
This means that on PRD we're wasting some cycles checking for deleted records for tables which will never going to have records. I guess this can be a problem if we do the decomposition and install triggers. After that step the table cannot get any deletes.
The fix would be to select the tables only belonging to the current schema.
@ahegyi after testing it further, I found out that I didn't list before everything that happened. Here is the full list of things that happen
When the worker runs on main, or on ci. The exact work happens when the items in loose_foreign_keys_deleted_records are the same
It deletes/nullify all the gitlab_main tables on main that are dependent on users
It deletes/nullify all the gitlab_main tables on main that are dependent on ci_pipelines
It deletes/nullify all the gitlab_ci tables on ci that are dependent on users
It deletes/nullify all the gitlab_ci tables on ci that are dependent on ci_pipelines
It marks all the loose_foreign_keys_deleted_records with status = 2 on the database that is running
Basically, when we decompose the database, and start the workers again, we do the duplicate the amount of the work. Which is totally fine, considering that the items are most probably not many, if there was items at all. Because we will most probably tell the customers to stop their web services and drain their Sidekiq workers.
My only concern that we run DELETE or UPDATE for a table on the wrong database, and we end up with getting lock writes error. Because when the database is decomposed, we lock the writes on the tables that are not relevant anymore. But we always choose the connection from the model, regardless of which database the SharedModel is connected to. So everything is perfect
For batched background migrations I think we should document that all migrations are completed before doing the migration to 2 databases. If that's the case then the migration tables will only be containing data related to completed migrations so this might narrow the scope of our investigations.
Adam Hegyimarked the checklist item db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared as completed
marked the checklist item db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared as completed
Adam Hegyimarked the checklist item db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared as incomplete
marked the checklist item db/docs/loose_foreign_keys_deleted_records.yml:gitlab_schema: gitlab_shared as incomplete