Use database SCHEMAs to better isolated CI decomposed features
Once we start decomposing features, it becomes a problem to clearly define boundaries between features to prevent code mistakes doing operations forbidden when many databases are used. This creates chaos what can be freed/removed from main database, or what foreign keys are allowed to be used.
Proposal
Usage of SCHEMA is convenient way to create logical split of tables within a single logical database.
For that particular purpose we should use SCHEMA to make it explicit that tables in this SCHEMA
are a separate logical structure that is migrated into another database.
Benefits of using SCHEMA:
- clearly indicate decomposed features with a
SCHEMA - detect foreign keys cross-joining
SCHEMA - limit visibility of schema depending on connection used
- safety: allow to run production in a permissive mode (seeing all schemas) to retain things just work
- safety: allow to run development/test in an enforcing mode (limiting schema visibility) to discover cross-joining features and forbid them as part of test run
- be able to clearly indicate data that is shared (tables required to be present in all databases) vs data owned by the
maindatabase exclusively that can be truncated/dropped in another context
Iteration plan
This plans is defining a number of incremental steps based on assumption that we are using untested
feature of Rails (schema_search_path) in a highly optimised environment (PgBouncer and Load Balancer)
and there's a high risk of causing a production incident.
- Development: Create migration
create_schema(:gitlab_ci)(There's a Rails helper inActiveRecord::Migration) - Development: Update
config/database.ymlto definemain: schema_search_path: public,gitlab_ciformain:. This needs to be populated in GDK/GCK/rspec/Omnibus/CNG. - Production: Validate that a proper
search_pathis used on production by doingActiveRecord::Base.connection.schema_search_path. - Development: Ship migration to do
ALTER TABLE ci_instance_variables SET SCHEMA gitlab_ci. - Development: Model a way to indicate in a code a way to what model schema belongs.
- Development: Validate that
ci_instance_variablesis part ofstructure.sql. - Development: Validate that
ci_instance_variablesis part ofbackup/restore(this feature defines exportable schemas) and works with projectimport/export. - Development: Build a failsafe FF in a codebase to make it possible to disable fetching of
ci_instance_variables(or make it possible to access it via explicitly defined name that includes schemagitlab_ci.ci_instance_variables) as we expect that this might break and in some cases schema visibility might be invalid. - Production: Rollout feature flag and validate that instance variables can be accessed.
- Development: Ship a rspec code that validates that we do not define FKs that cross-join schemas.
- Development: Ship a code that discovers and disallows schema cross-joins (likely not possible yet).
- Once confirmed, this is done.
Example way to indicate schemas
# config/database.yml
production:
main:
schema_search_path: public,gitlab_ci
# migration
def up
create_schema(:gitlab_ci)
execute("ALTER TABLE ci_instance_variables SET SCHEMA gitlab_ci")
end
# Describe schema
# As we can later use that to validate cross-schema joins
class ApplicationRecord
class_attribute :schema_name, default: :public
end
class Ci::BaseModel < ApplicationRecord
self.schema_name = :gitlab_ci
end
# Dynamically alter schema visibility
module DatabaseSchemaHelper
def with_search_paths(connection, search_path)
# we need transaction to ensure that changes are local
# this is required by PgBouncer which might change connections on each subsequent call
connection.transaction do
saved = connection.schema_search_path
connection.schema_search_path = search_path
yield
end
ensure
connection.schema_search_path = saved
end
end
Risks
There are risks associated with usage of schemas:
- Our customers might be not allowing to create/manage schemas, which will result in failed migrations
- Dynamically changing schema outside of transaction might change a global context, resulting in an invalid visibility for subsequent requests
- We might find that schema visibility is not always set if used with PgBouncer, or our Load Balancer implementation
- We might find that additional schemas are not properly exported as part of backups, or as part of
structure.sql - Validate if PgBouncer managed connections can end-up in a state where invalid schema is used: #333415 (comment 616830582).
- Customers using a different name for main schema, that is not
public
Ideally
We would create and set visibility accordingly.
-
gitlab_shared- all tables that needs to live in every logical database, like:schema_migrationsand*background_migrations -
gitlab_main- all tables related tomaindatabase -
gitlab_ci- all tables related tocidatabase
Shared tables
Migrations tables
There are number of tables that are required to be placed in both places (aka. gitlab_shared), and are required for each newly configured database:
-
schema_migrationsrequired by Rails to know what migrations were executed -
batched_background_migration_jobs,background_migration_jobsrequired by GitLab Rails to implement a method to have a predictable long running migrations
It is without a question that schema_migrations needs to be local to the database.
It is to question if background_migrations* does need to be. However, the good practice
rather says that it should, since BG migrations is affecting always a specific tables
for which it is to run. Keeping these tables local to what is to be modified is likely
a desirable architecture. It would create a chaos to keep BG migrations only in central
database
Partitioning schemas
We use a separate schema for hiding partitioned tables gitlab_partitions_dynamic and gitlab_partitions_static.
This serves like a bucket for all partitioned tables.
Likely this model should be change into being explicit about decomposed partition, like:
-
gitlab_main_partitions_dynamicandgitlab_main_partitions_static -
gitlab_ci_partitions_dynamicandgitlab_ci_partitions_static