Decide on a single `structure.sql` vs `ci_structure.sql`

One of the problems raised as part of the #334690 (closed) was the concern regarding usage of a single structure.sql vs many *structure.sql.

Methods

Lets consider the following database config after the #333415 (closed), and #335076 (closed). This is expected end-goal type of config:

main:
  database: gitlab_main
  schema_search_paths: gitlab_main,gitlab_shared
ci:
  database: gitlab_ci
  schema_search_paths: gitlab_ci,gitlab_shared

Remark: Today, everything by default uses public schema. This changes the public to become gitlab_main as an indication that most features use main database. This is discussed in more detail as part of #333415 (closed).

1. Single `structure.sql`

The single structure.sql defines all tables and is populated on many databases, making us to duplicate the same structure in many places.

main:
  database: gitlab_main
  migration_paths: [db/migrate, db/post_migrate]
  schema_search_paths: gitlab_main,gitlab_shared
ci:
  database: gitlab_ci
  migration_paths: [db/migrate, db/post_migrate]
  schema_search_paths: gitlab_ci,gitlab_shared

The structure.sql represents a definition of the whole logical database.

This model:

Provides a bloated structure.sql that is replicated on many databases
This implies that a shared db/migrate and db/post_migrate needs to be used to replicate all migrations on both databases
There is a risk that we might run migrations in context of another database, with different schema visibilty where it can result in running some migrations many times
Example to consider is the migration procedure: we migrate data of gitlab_ci schema into separate database server, how we will run migration on gitlab_main for the gitlab_ci? Will we skip them or run all of them replicating their behavior
Likely we need to introduce a schema matcher as part of requires_schema to check this against our schema visibility and only run it there
Likely this requires (due to how rails manages structure.sql) to create a symlink between ln -s structure.sql ci_structure.sql to ensure that the same schema is used
Likely this requires modifying Rails from emitting ci_structure.sql

# structure.sql
CREATE SCHEMA gitlab_main;
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_main.projects;
CREATE_TABLE gitlab_ci.ci_builds;

class MyMigration < ActiveRecord::Migration
  requires_schema :gitlab_main
  # requires_schema :gitlab_ci

  def up
    #
  end
end

2. Many `*structure.sql`

In this model we have many *structure.sql, dedicated for each logical database:

db/structure.sql defines a structure for main database
db/ci_structure.sql defines a structure for ci database

The *structure.sql represents a definition of the whole logical database.

# structure.sql
CREATE SCHEMA gitlab_main;
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_main.projects;
CREATE_TABLE gitlab_ci.ci_builds;

# ci_structure.sql
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_ci.ci_builds;

This implies that each structure has its own set of migrations. There might be different models how migrations can be executed, depending on iterations:

Model 1: distinct folders

main:
  database: gitlab_main
  migration_paths: [db/migrate, db/post_migrate, db/ci_migrate, db/ci_post_migrate]
  schema_search_paths: gitlab_main,gitlab_shared
ci:
  database: gitlab_ci
  migration_paths: [db/ci_migrate, db/ci_post_migrate]
  schema_search_paths: gitlab_ci,gitlab_shared

db/migrate, db/post_migrate - migrations for db/structure.sql
db/ci_migrate, db/ci_post_migrate - migrations for db/ci_structure.sql

This requires that each migration concerning CI needs to be copied into db/migrate as we don't perform automated reply of them on top of db/structure.sql.

This models:

Provides a bloated structure.sql that is used on main database, and clean ci_structure.sql
Requires that all migrations of CI are run as well on main
Requires a way to discover what schema is being modified (likely, by convention, files put into db/ci_migrate modify gitlab_ci schema)
There is a risk that we might run migrations in context of another database, with different schema visibility where it can result in running some migrations many times
Example to consider is the migration procedure: we migrate data of gitlab_ci schema into separate database server, how we will run migration on gitlab_main for the gitlab_ci? Will we skip them or run all of them replicating their behavior
We need a way to ensure consistency between part of the structure.sql and ci_structure.sql (a special pg_dump to validate that subset of schema is exactly the same)?

Model 2: inherited folder

main:
  database: gitlab_main
  migration_paths: [db/migrate, db/post_migrate]
  schema_search_paths: gitlab_main,gitlab_shared
ci:
  database: gitlab_ci
  migration_paths: [db/migrate/ci, db/post_migrate/ci]
  schema_search_paths: gitlab_ci,gitlab_shared

db/migrate, db/post_migrate - migrations for db/structure.sql, which will additionally execute all nested migrations from /ci
db/migrate/ci, db/post_migrate/ci - migrations for db/ci_structure.sql

This requires that each migration concerning the CI needs to be in db/migrate/ci.

This models:

Provides a bloated structure.sql that is used on main database, and clean ci_structure.sql
Requires that all migrations of CI are run as well on main
Requires a way to discover what schema is being modified (likely, by convention, files put into db/migrate/<schema_name> modify gitlab_ci schema)
There is a risk that we might run migrations in context of another database, with different schema visibility where it can result in running some migrations many times
Example to consider is the migration procedure: we migrate data of gitlab_ci schema into separate database server, how we will run migration on gitlab_main for the gitlab_ci? Will we skip them or run all of them replicating their behavior
We need a way to ensure consistency between part of the structure.sql and ci_structure.sql (a special pg_dump to validate that subset of schema is exactly the same)?

3. Many `*structure.sql` schema focused

This is likely our end-goal. Have clean *structure.sql containing only definitions related to processed data.

In this model we have many *structure.sql, dedicated for each set of schema visibility:

db/structure.sql defines a structure for public (or proposed gitlab_main) schema, used with main database
db/ci_structure.sql defines a structure for gitlab_ci schema, used with CI database

main:
  database: gitlab_main
  migration_paths: [db/migrate, db/post_migrate]
  schema_search_paths: gitlab_main,gitlab_shared
ci:
  database: gitlab_ci
  migration_paths: [db/ci_migrate, db/ci_post_migrate]
  schema_search_paths: gitlab_ci,gitlab_shared

This approach moves away from having a structure.sql defining a whole logical database, instead moves into schema management. This is an approach that is natively supported by Rails with config.active_record.dump_schemas = :schema_search_path, which allows to clearly define which schemas are to be dumped.

This approach implies that many configurations for many connections are used, allowing to point to the same logical database, but managing only subset of the database with schema.

This models:

Requires that we do not use a single connection for many schemas as described as 1. A single monolithic database (not migrated), with a single connection. This will not work, since we will not have a single big structure.sql. This might be a blocker for some installations, but as well this is a great end-goal to have.
All migrations are run within a specific connection, and specific defined visibility
Schema search path defines a visibility and we don't need to add additional semantic for migrations
Since we do not run migrations many times, we guarantee that migrations process only what is needed
We don't need to maintain consistency, since we guarantee that these are self-sufficient structures
In a transition period the main: { schema_search_path: gitlab_main,gitlab_ci } can be used to avoid bugs related to cross-joins
Provides a truly clean *structure.sql without any unrelated schemas

# structure.sql
CREATE SCHEMA gitlab_main;
CREATE_TABLE gitlab_main.projects;

# ci_structure.sql
CREATE SCHEMA gitlab_ci;
CREATE_TABLE gitlab_ci.ci_builds;

4. Problems to solve

This defines a set of problems that we want to solve, these are not tied to particular method, but rather a set of a general design guidelines which are not ordered by the importance:

a clean/tidy structure.sql containing only relevant tables required by a given database
easy and predictable behavior of migrations without duplication or overlap
predictable behavior and support for non-migrated customers: customers using a single monolithic database
predictable behavior for partially migrated customers: ex.: data got copied, many databases are used, but old database still contains all CI data as it is not yet cleaned-up
easy way to develop new migrations without hacks/extra checks about execution context (ex. of anti-pattern is Gitlab.com?)
coherency of structure between databases that is automatically validated
predictable behavior and tracking of schema migrations and background migrations
we need a way to model and maintain shared tables: #333415 (closed)

Edited Jul 02, 2021 by Kamil Trzciński

Decide on a single `structure.sql` vs `ci_structure.sql`

Methods

1. Single structure.sql

2. Many *structure.sql

Model 1: distinct folders

Model 2: inherited folder

3. Many *structure.sql schema focused

4. Problems to solve

1. Single `structure.sql`

2. Many `*structure.sql`

3. Many `*structure.sql` schema focused