Use database SCHEMAs to better isolated CI decomposed features

Once we start decomposing features, it becomes a problem to clearly define boundaries between features to prevent code mistakes doing operations forbidden when many databases are used. This creates chaos what can be freed/removed from main database, or what foreign keys are allowed to be used.

Proposal

Usage of SCHEMA is convenient way to create logical split of tables within a single logical database. For that particular purpose we should use SCHEMA to make it explicit that tables in this SCHEMA are a separate logical structure that is migrated into another database.

Benefits of using SCHEMA:

clearly indicate decomposed features with a SCHEMA
detect foreign keys cross-joining SCHEMA
limit visibility of schema depending on connection used
safety: allow to run production in a permissive mode (seeing all schemas) to retain things just work
safety: allow to run development/test in an enforcing mode (limiting schema visibility) to discover cross-joining features and forbid them as part of test run
be able to clearly indicate data that is shared (tables required to be present in all databases) vs data owned by the main database exclusively that can be truncated/dropped in another context

Iteration plan

This plans is defining a number of incremental steps based on assumption that we are using untested feature of Rails (schema_search_path) in a highly optimised environment (PgBouncer and Load Balancer) and there's a high risk of causing a production incident.

Development: Create migration create_schema(:gitlab_ci) (There's a Rails helper in ActiveRecord::Migration)
Development: Update config/database.yml to define main: schema_search_path: public,gitlab_ci for main:. This needs to be populated in GDK/GCK/rspec/Omnibus/CNG.
Production: Validate that a proper search_path is used on production by doing ActiveRecord::Base.connection.schema_search_path.
Development: Ship migration to do ALTER TABLE ci_instance_variables SET SCHEMA gitlab_ci.
Development: Model a way to indicate in a code a way to what model schema belongs.
Development: Validate that ci_instance_variables is part of structure.sql.
Development: Validate that ci_instance_variables is part of backup/restore (this feature defines exportable schemas) and works with project import/export.
Development: Build a failsafe FF in a codebase to make it possible to disable fetching of ci_instance_variables (or make it possible to access it via explicitly defined name that includes schema gitlab_ci.ci_instance_variables) as we expect that this might break and in some cases schema visibility might be invalid.
Production: Rollout feature flag and validate that instance variables can be accessed.
Development: Ship a rspec code that validates that we do not define FKs that cross-join schemas.
Development: Ship a code that discovers and disallows schema cross-joins (likely not possible yet).
Once confirmed, this is done.

Example way to indicate schemas

# config/database.yml
production:
  main:
    schema_search_path: public,gitlab_ci

# migration
def up
  create_schema(:gitlab_ci)
  execute("ALTER TABLE ci_instance_variables SET SCHEMA gitlab_ci")
end

# Describe schema
# As we can later use that to validate cross-schema joins
class ApplicationRecord
  class_attribute :schema_name, default: :public
end

class Ci::BaseModel < ApplicationRecord
  self.schema_name = :gitlab_ci
end

# Dynamically alter schema visibility
module DatabaseSchemaHelper
  def with_search_paths(connection, search_path)
    # we need transaction to ensure that changes are local
    # this is required by PgBouncer which might change connections on each subsequent call
    connection.transaction do
      saved = connection.schema_search_path
      connection.schema_search_path = search_path
      yield
    end
  ensure
    connection.schema_search_path = saved
  end
end

Risks

There are risks associated with usage of schemas:

Our customers might be not allowing to create/manage schemas, which will result in failed migrations
Dynamically changing schema outside of transaction might change a global context, resulting in an invalid visibility for subsequent requests
We might find that schema visibility is not always set if used with PgBouncer, or our Load Balancer implementation
We might find that additional schemas are not properly exported as part of backups, or as part of structure.sql
Validate if PgBouncer managed connections can end-up in a state where invalid schema is used: #333415 (comment 616830582).
Customers using a different name for main schema, that is not public

Ideally

We would create and set visibility accordingly.

gitlab_shared - all tables that needs to live in every logical database, like: schema_migrations and *background_migrations
gitlab_main - all tables related to main database
gitlab_ci - all tables related to ci database

Shared tables

Migrations tables

There are number of tables that are required to be placed in both places (aka. gitlab_shared), and are required for each newly configured database:

schema_migrations required by Rails to know what migrations were executed
batched_background_migration_jobs, background_migration_jobs required by GitLab Rails to implement a method to have a predictable long running migrations

It is without a question that schema_migrations needs to be local to the database.

It is to question if background_migrations* does need to be. However, the good practice rather says that it should, since BG migrations is affecting always a specific tables for which it is to run. Keeping these tables local to what is to be modified is likely a desirable architecture. It would create a chaos to keep BG migrations only in central database

Partitioning schemas

We use a separate schema for hiding partitioned tables gitlab_partitions_dynamic and gitlab_partitions_static. This serves like a bucket for all partitioned tables.

Likely this model should be change into being explicit about decomposed partition, like:

gitlab_main_partitions_dynamic and gitlab_main_partitions_static
gitlab_ci_partitions_dynamic and gitlab_ci_partitions_static

Edited Jul 02, 2021 by Kamil Trzciński