Decide on a single `structure.sql` vs `ci_structure.sql`
One of the problems raised as part of the #334690 (closed)
was the concern regarding usage of a single structure.sql
vs many *structure.sql
.
Methods
Lets consider the following database config after the #333415 (closed), and #335076 (closed). This is expected end-goal type of config:
main:
database: gitlab_main
schema_search_paths: gitlab_main,gitlab_shared
ci:
database: gitlab_ci
schema_search_paths: gitlab_ci,gitlab_shared
Remark: Today, everything by default uses
public schema
. This changes thepublic
to becomegitlab_main
as an indication that most features usemain
database. This is discussed in more detail as part of #333415 (closed).
structure.sql
1. Single The single structure.sql
defines all tables and is populated on many databases,
making us to duplicate the same structure in many places.
main:
database: gitlab_main
migration_paths: [db/migrate, db/post_migrate]
schema_search_paths: gitlab_main,gitlab_shared
ci:
database: gitlab_ci
migration_paths: [db/migrate, db/post_migrate]
schema_search_paths: gitlab_ci,gitlab_shared
The structure.sql
represents a definition of the whole logical database.
This model:
- Provides a bloated
structure.sql
that is replicated on many databases - This implies that a shared
db/migrate
anddb/post_migrate
needs to be used to replicate all migrations on both databases - There is a risk that we might run migrations in context of another database, with different schema visibilty where it can result in running some migrations many times
- Example to consider is the migration procedure: we migrate
data of gitlab_ci schema
into separate database server, how we will run migration ongitlab_main
for thegitlab_ci
? Will we skip them or run all of them replicating their behavior - Likely we need to introduce a schema matcher as part of
requires_schema
to check this against ourschema visibility
and only run it there - Likely this requires (due to how rails manages
structure.sql
) to create a symlink betweenln -s structure.sql ci_structure.sql
to ensure that the same schema is used - Likely this requires modifying Rails from emitting
ci_structure.sql
# structure.sql
CREATE SCHEMA gitlab_main;
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_main.projects;
CREATE_TABLE gitlab_ci.ci_builds;
class MyMigration < ActiveRecord::Migration
requires_schema :gitlab_main
# requires_schema :gitlab_ci
def up
#
end
end
*structure.sql
2. Many In this model we have many *structure.sql
, dedicated for each logical database:
-
db/structure.sql
defines a structure formain
database -
db/ci_structure.sql
defines a structure forci
database
The *structure.sql
represents a definition of the whole logical database.
# structure.sql
CREATE SCHEMA gitlab_main;
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_main.projects;
CREATE_TABLE gitlab_ci.ci_builds;
# ci_structure.sql
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_ci.ci_builds;
This implies that each structure has its own set of migrations. There might be different models how migrations can be executed, depending on iterations:
Model 1: distinct folders
main:
database: gitlab_main
migration_paths: [db/migrate, db/post_migrate, db/ci_migrate, db/ci_post_migrate]
schema_search_paths: gitlab_main,gitlab_shared
ci:
database: gitlab_ci
migration_paths: [db/ci_migrate, db/ci_post_migrate]
schema_search_paths: gitlab_ci,gitlab_shared
-
db/migrate
,db/post_migrate
- migrations fordb/structure.sql
-
db/ci_migrate
,db/ci_post_migrate
- migrations fordb/ci_structure.sql
This requires that each migration concerning CI needs to be copied into db/migrate
as we don't perform automated reply of them on top of db/structure.sql
.
This models:
- Provides a bloated
structure.sql
that is used on main database, and cleanci_structure.sql
- Requires that all migrations of CI are run as well on
main
- Requires a way to discover what schema is being modified (likely, by convention, files put into
db/ci_migrate
modifygitlab_ci
schema) - There is a risk that we might run migrations in context of another database, with different schema visibility where it can result in running some migrations many times
- Example to consider is the migration procedure: we migrate
data of gitlab_ci schema
into separate database server, how we will run migration ongitlab_main
for thegitlab_ci
? Will we skip them or run all of them replicating their behavior - We need a way to ensure consistency between part of the
structure.sql
andci_structure.sql
(a specialpg_dump
to validate that subset of schema is exactly the same)?
Model 2: inherited folder
main:
database: gitlab_main
migration_paths: [db/migrate, db/post_migrate]
schema_search_paths: gitlab_main,gitlab_shared
ci:
database: gitlab_ci
migration_paths: [db/migrate/ci, db/post_migrate/ci]
schema_search_paths: gitlab_ci,gitlab_shared
-
db/migrate
,db/post_migrate
- migrations fordb/structure.sql
, which will additionally execute all nested migrations from/ci
-
db/migrate/ci
,db/post_migrate/ci
- migrations fordb/ci_structure.sql
This requires that each migration concerning the CI needs to be in db/migrate/ci
.
This models:
- Provides a bloated
structure.sql
that is used on main database, and cleanci_structure.sql
- Requires that all migrations of CI are run as well on
main
- Requires a way to discover what schema is being modified (likely, by convention, files put into
db/migrate/<schema_name>
modifygitlab_ci
schema) - There is a risk that we might run migrations in context of another database, with different schema visibility where it can result in running some migrations many times
- Example to consider is the migration procedure: we migrate
data of gitlab_ci schema
into separate database server, how we will run migration ongitlab_main
for thegitlab_ci
? Will we skip them or run all of them replicating their behavior - We need a way to ensure consistency between part of the
structure.sql
andci_structure.sql
(a specialpg_dump
to validate that subset of schema is exactly the same)?
*structure.sql
schema focused
3. Many This is likely our end-goal. Have clean *structure.sql
containing only definitions
related to processed data.
In this model we have many *structure.sql
, dedicated for each set of schema visibility:
-
db/structure.sql
defines a structure forpublic
(or proposedgitlab_main
) schema, used withmain
database -
db/ci_structure.sql
defines a structure forgitlab_ci
schema, used withCI
database
main:
database: gitlab_main
migration_paths: [db/migrate, db/post_migrate]
schema_search_paths: gitlab_main,gitlab_shared
ci:
database: gitlab_ci
migration_paths: [db/ci_migrate, db/ci_post_migrate]
schema_search_paths: gitlab_ci,gitlab_shared
This approach moves away from having a structure.sql defining a whole logical database,
instead moves into schema management. This is an approach that is natively supported by Rails
with config.active_record.dump_schemas = :schema_search_path
, which allows to clearly define
which schemas are to be dumped.
This approach implies that many configurations for many connections are used, allowing to point to the same logical database, but managing only subset of the database with schema.
This models:
- Requires that we do not use a single connection for many schemas as described as 1. A single monolithic database (not migrated), with a single connection. This will not work, since we will not have a single big
structure.sql
. This might be a blocker for some installations, but as well this is a great end-goal to have. - All migrations are run within a specific connection, and specific defined visibility
- Schema search path defines a visibility and we don't need to add additional semantic for migrations
- Since we do not run migrations many times, we guarantee that migrations process only what is needed
- We don't need to maintain consistency, since we guarantee that these are self-sufficient structures
- In a transition period the
main: { schema_search_path: gitlab_main,gitlab_ci }
can be used to avoid bugs related to cross-joins - Provides a truly clean
*structure.sql
without any unrelated schemas
# structure.sql
CREATE SCHEMA gitlab_main;
CREATE_TABLE gitlab_main.projects;
# ci_structure.sql
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_ci.ci_builds;
4. Problems to solve
This defines a set of problems that we want to solve, these are not tied to particular method, but rather a set of a general design guidelines which are not ordered by the importance:
- a clean/tidy
structure.sql
containing only relevant tables required by a given database - easy and predictable behavior of migrations without duplication or overlap
- predictable behavior and support for non-migrated customers: customers using a single monolithic database
- predictable behavior for partially migrated customers: ex.: data got copied, many databases are used, but old database still contains all CI data as it is not yet cleaned-up
- easy way to develop new migrations without hacks/extra checks about execution context (ex. of anti-pattern is
Gitlab.com?
) - coherency of structure between databases that is automatically validated
- predictable behavior and tracking of schema migrations and background migrations
- we need a way to model and maintain shared tables: #333415 (closed)