Skip to content

Support database migrations via `db/migrate` to be properly executed in a relevant context

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

This is follow-up after: #339902 (closed).

Related to: #341783 (closed).

Problem

  1. Today all db/migrate changes (DDL, selects, and DML) are executed on top of all databases (main and CI)
  2. After decomposed the data will be split into two data stores (main and CI) limiting what migrations can do in a given context
  3. Since the migrations code to execute (db/migrate) is being shared, but data is split (decomposed database) we need to be clever what migrations are executed in what context

Proposal

1. Design proposal

  1. We made a design decision some time ago to share db/structure.sql across all databases, so all DDL changes needs to be replicated
  2. We made a decision to not share data (so tables of CI will be truncated in main context and vice-versa) so we should not run any select/DML changes on top of other database
  3. The db/migrate can only talk to the current database. It is hard/not-desired to allow db/migrate to cross-talk between database so it only can mutate data of an existing context (main or CI)
  4. Cross-talk between databases can only be implemented via background migrations which can be scheduled from db/migrate
  5. Migration needs to be of a single purpose. It needs to modify DDL or select/DML never both

2. Implementation proposal

  1. Following the #339902 (comment 690167297). The change is to move away from gitlab_migration into a simpler concept of restrict_gitlab_migration. We need to only limit migration to a given context, since all DDL changes are replicated everywhere.
  2. Describe db/migrate and db/post_migrate with the purpose of migration in a form of restrict_gitlab_migration schema: :gitlab_main, environment: ...
  3. Implement an application-level validation using pg_query similar to how we do it with cross-joins to validate the purpose of the query made by the migration. Check if it adheres to gitlab_schemas of config/database.yml
  4. Properly configure config/database.yml for a decomposed DB with a `database_tasks:
  5. The behavior of restrict_gitlab_migration is to skip migrations unless matching conditions
  6. All select/DML changes require the presence of restrict_gitlab_migration as part of the Gitlab::Database::Migration[2.0]

3. Validation of migrations

  1. The restrict_gitlab_migration serves a purpose to declare what a given migration does and is meant to implement a way to skip migrations unless matching conditions
  2. The restrict_gitlab_migration on its own does not validate the migration
  3. The actual validation of migration is done as part of running decomposed CI testing and replaying latest migrations
  4. The decomposed CI testing defines on each connection what tables can be read or written
  5. The migration reading/modifying data on a given connection outside of defined gitlab_schemas will be disallowed, which will raise an exception and will fail CI
  6. The CI job validates that db/structure.sql produced by main and CI is exactly the same

4. Used decomposed database config in CI for testing migrations

Example migrations

Simple DDL migration being executed in all cases since we reply all structure.sql (allowed)

For majority of cases migrations does not need to be changed since the majority of our migrations perform only DDL changes (add/remove columns, add/remove foreign keys, create new indexes).

class AddReportTypeIndexIntoApprovalProjectRules < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  INDEX_NAME = 'index_approval_project_rules_report_type'

  def up
    add_concurrent_index :approval_project_rules, :report_type, name: INDEX_NAME
  end

  def down
    remove_concurrent_index_by_name :approval_project_rules, name: INDEX_NAME
  end
end

Simple DML migration (allowed)

Since this migration mutates data of file_registry it can only be executed in gitlab_main.

class FixStateColumnInFileRegistry < Gitlab::Database::Migration[2.0]
  restrict_gitlab_migration gitlab_schema: :gitlab_main

  disable_ddl_transaction!

  # The following cop is disabled because of https://gitlab.com/gitlab-org/gitlab/issues/33470
  # rubocop:disable Migration/UpdateColumnInBatches
  def up
    update_column_in_batches(:file_registry, :state, 2) do |table, query|
      query.where(table[:success].eq(true)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end
  # rubocop:enable Migration/UpdateColumnInBatches

  def down
    # no-op
  end
end

Complex DDL and DML change (forbidden)

Such migrations are forbidden as they modify DDL and DML in a single execution. This means that we are unable to apply DDL changes on all databases, and DML changes only on main database.

class DeleteLfsObjectsFromFileRegistry < ActiveRecord::Migration[6.0]
  DOWNTIME = false

  def up
    execute("DELETE FROM file_registry WHERE file_type = 'lfs'")
    execute('DROP TRIGGER IF EXISTS replicate_lfs_object_registry ON file_registry')
    execute('DROP FUNCTION IF EXISTS replicate_lfs_object_registry()')
  end

  def down
    # no-op
  end
end

Schedule of background migrations (allowed)

This type of migration is a select-type it means that it requires source data. This requires that such migration needs to be only executed in a single context (main or ci).

class PopulateCanonicalEmails < Gitlab::Database::Migration[2.0]
  restrict_gitlab_migration gitlab_schema: :gitlab_main

  disable_ddl_transaction!

  class User < ActiveRecord::Base
    include EachBatch

    self.table_name = 'users'

    scope :with_gmail, -> { select(:id, :email).where("email ILIKE '%gmail.com'") }
  end

  # Limited to *@gmail.com addresses only as a first iteration, because we know
  # Gmail ignores `.` appearing in the Agent name, as well as anything after `+`

  def up
    # batch size is the default, 1000
    migration = Gitlab::BackgroundMigration::PopulateCanonicalEmails
    migration_name = migration.to_s.demodulize

    queue_background_migration_jobs_by_range_at_intervals(
      User.with_gmail,
      migration_name,
      1.minute)
  end

  def down
    # no-op
  end
end

Usage of ActiveRecord models for accessing data (allowed)

All models nested into migration needs to be based as ActiveRecord::Base regardless on what database the data is located (main or CI). This works out of box since Rails when running migrations in context of main or ci does switch a connection on ActiveRecord::Base. Additionally depending on where the given migration is executed the ActiveRecord::Base.connection_db_config.name == 'main' or 'ci'.

This already is coherent with our todays guidelines.

class ConfirmSecurityBot < Gitlab::Database::Migration[2.0]
  restrict_gitlab_migration gitlab_schema: :gitlab_main

  class User < ActiveRecord::Base
    self.table_name = 'users'
    SECURITY_BOT_TYPE = 8
  end

  def up
    User.where(user_type: User::SECURITY_BOT_TYPE, confirmed_at: nil)
      .update_all(confirmed_at: Time.current)
  end

  # no-op
  # Security Bot should be always confirmed
  def down
  end
end

Iterations

  1. Implement a way to validate if connection is being used in a correct way: a right tables are being accessed only: !71540 (closed)
  2. Implement restrict_* helpers
  3. Document restrict_* helpers and a way to use them
  4. Implement and run decomposed migrations that use schema usage validation and restrict helpers to ensure that migrations are properly written
  5. (Optional) Make the gitlab_schema validation as part of production code to see if connections are used in a right way

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited by Kamil Trzciński