Skip to content

Adds How-to section to BBM docs

What does this MR do and why?

It adds a how-to section to BBM docs.

  • Re-organize some topics under this new section:
    • Generate a batched background migration
    • Enqueue a batched background migration
    • Use job arguments
    • Use filters
    • Access data for multiple databases
    • Re-queue batched background migrations
    • Batch over non-distinct columns
    • 🆕 Calculate overall time estimation of a batched background migration
    • Cleaning up a batched background migration
  • Adds a new section called Calculate overall time estimation of a batched background migration
Changes

How to

Generate a batched background migration

The custom generator batched_background_migration scaffolds necessary files and accepts table_name, column_name, and feature_category as arguments. Usage:

bundle exec rails g batched_background_migration my_batched_migration --table_name=<table-name> --column_name=<column-name> --feature_category=<feature-category>

This command creates the following files:

  • db/post_migrate/20230214231008_queue_my_batched_migration.rb
  • spec/migrations/20230214231008_queue_my_batched_migration_spec.rb
  • lib/gitlab/background_migration/my_batched_migration.rb
  • spec/lib/gitlab/background_migration/my_batched_migration_spec.rb

Enqueue a batched background migration

Queueing a batched background migration should be done in a post-deployment migration. Use this queue_batched_background_migration example, queueing the migration to be executed in batches. Replace the class name and arguments with the values from your migration:

queue_batched_background_migration(
  JOB_CLASS_NAME,
  TABLE_NAME,
  JOB_ARGUMENTS,
  JOB_INTERVAL
  )

NOTE: This helper raises an error if the number of provided job arguments does not match the number of job arguments defined in JOB_CLASS_NAME.

Make sure the newly-created data is either migrated, or saved in both the old and new version upon creation. Removals in turn can be handled by defining foreign keys with cascading deletes.

Use job arguments

BatchedMigrationJob provides the job_arguments helper method for job classes to define the job arguments they need.

Batched migrations scheduled with queue_batched_background_migration must use the helper to define the job arguments:

queue_batched_background_migration(
  'CopyColumnUsingBackgroundMigrationJob',
  TABLE_NAME,
  'name', 'name_convert_to_text',
  job_interval: DELAY_INTERVAL
)

NOTE: If the number of defined job arguments does not match the number of job arguments provided when scheduling the migration, queue_batched_background_migration raises an error.

In this example, copy_from returns name, and copy_to returns name_convert_to_text:

class CopyColumnUsingBackgroundMigrationJob < BatchedMigrationJob
  job_arguments :copy_from, :copy_to
  operation_name :update_all

  def perform
    from_column = connection.quote_column_name(copy_from)
    to_column = connection.quote_column_name(copy_to)

    assignment_clause = "#{to_column} = #{from_column}"

    each_sub_batch do |relation|
      relation.update_all(assignment_clause)
    end
  end
end

Use filters

By default, when creating background jobs to perform the migration, batched background migrations iterate over the full specified table. This iteration is done using the PrimaryKeyBatchingStrategy. If the table has 1000 records and the batch size is 100, the work is batched into 10 jobs. For illustrative purposes, EachBatch is used like this:

# PrimaryKeyBatchingStrategy
Namespace.each_batch(of: 100) do |relation|
  relation.where(type: nil).update_all(type: 'User') # this happens in each background job
end

In some cases, only a subset of records must be examined. If only 10% of the 1000 records need examination, apply a filter to the initial relation when the jobs are created:

Namespace.where(type: nil).each_batch(of: 100) do |relation|
  relation.update_all(type: 'User')
end

In the first example, we don't know how many records will be updated in each batch. In the second (filtered) example, we know exactly 100 will be updated with each batch.

BatchedMigrationJob provides a scope_to helper method to apply additional filters and achieve this:

  1. Create a new migration job class that inherits from BatchedMigrationJob and defines the additional filter:

    class BackfillNamespaceType < BatchedMigrationJob
      scope_to ->(relation) { relation.where(type: nil) }
      operation_name :update_all
      feature_category :source_code_management
    
      def perform
        each_sub_batch do |sub_batch|
          sub_batch.update_all(type: 'User')
        end
      end
    end

    NOTE: For EE migrations that define scope_to, ensure the module extends ActiveSupport::Concern. Otherwise, records are processed without taking the scope into consideration.

  2. In the post-deployment migration, enqueue the batched background migration:

    class BackfillNamespaceType < Gitlab::Database::Migration[2.1]
      MIGRATION = 'BackfillNamespaceType'
      DELAY_INTERVAL = 2.minutes
    
      restrict_gitlab_migration gitlab_schema: :gitlab_main
    
      def up
        queue_batched_background_migration(
          MIGRATION,
          :namespaces,
          :id,
          job_interval: DELAY_INTERVAL
        )
      end
    
      def down
        delete_batched_background_migration(MIGRATION, :namespaces, :id, [])
      end
    end

NOTE: When applying additional filters, it is important to ensure they are properly covered by an index to optimize EachBatch performance. In the example above we need an index on (type, id) to support the filters. See the EachBatch documentation for more information.

Access data for multiple databases

Background Migration contrary to regular migrations does have access to multiple databases and can be used to efficiently access and update data across them. To properly indicate a database to be used it is desired to create ActiveRecord model inline the migration code. Such model should use a correct ApplicationRecord depending on which database the table is located. As such usage of ActiveRecord::Base is disallowed as it does not describe a explicitly database to be used to access given table.

# good
class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
  class Project < ::ApplicationRecord
    self.table_name = 'projects'
  end

  class Build < ::Ci::ApplicationRecord
    self.table_name = 'ci_builds'
  end
end

# bad
class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
  class Project < ActiveRecord::Base
    self.table_name = 'projects'
  end

  class Build < ActiveRecord::Base
    self.table_name = 'ci_builds'
  end
end

Similarly the usage of ActiveRecord::Base.connection is disallowed and needs to be replaced preferably with the usage of model connection.

# good
Project.connection.execute("SELECT * FROM projects")

# acceptable
ApplicationRecord.connection.execute("SELECT * FROM projects")

# bad
ActiveRecord::Base.connection.execute("SELECT * FROM projects")

Re-queue batched background migrations

If one of the batched background migrations contains a bug that is fixed in a patch release, you must requeue the batched background migration so the migration repeats on systems that already performed the initial migration.

When you requeue the batched background migration, turn the original queuing into a no-op by clearing up the #up and #down methods of the migration performing the requeuing. Otherwise, the batched background migration is queued multiple times on systems that are upgrading multiple patch releases at once.

When you start the second post-deployment migration, delete the previously batched migration with the provided code:

delete_batched_background_migration(MIGRATION_NAME, TABLE_NAME, COLUMN, JOB_ARGUMENTS)

Batch over non-distinct columns

The default batching strategy provides an efficient way to iterate over primary key columns. However, if you need to iterate over columns where values are not unique, you must use a different batching strategy.

The LooseIndexScanBatchingStrategy batching strategy uses a special version of EachBatch to provide efficient and stable iteration over the distinct column values.

This example shows a batched background migration where the issues.project_id column is used as the batching column.

Database post-migration:

class ProjectsWithIssuesMigration < Gitlab::Database::Migration[2.1]
  MIGRATION = 'BatchProjectsWithIssues'
  INTERVAL = 2.minutes
  BATCH_SIZE = 5000
  SUB_BATCH_SIZE = 500
  restrict_gitlab_migration gitlab_schema: :gitlab_main

  disable_ddl_transaction!
  def up
    queue_batched_background_migration(
      MIGRATION,
      :issues,
      :project_id,
      job_interval: INTERVAL,
      batch_size: BATCH_SIZE,
      batch_class_name: 'LooseIndexScanBatchingStrategy', # Override the default batching strategy
      sub_batch_size: SUB_BATCH_SIZE
    )
  end

  def down
    delete_batched_background_migration(MIGRATION, :issues, :project_id, [])
  end
end

Implementing the background migration class:

module Gitlab
  module BackgroundMigration
    class BatchProjectsWithIssues < Gitlab::BackgroundMigration::BatchedMigrationJob
      include Gitlab::Database::DynamicModelHelpers

      operation_name :backfill_issues

      def perform
        distinct_each_batch do |batch|
          project_ids = batch.pluck(batch_column)
          # do something with the distinct project_ids
        end
      end
    end
  end
end

NOTE: Additional filters defined with scope_to are ignored by LooseIndexScanBatchingStrategy and distinct_each_batch.

Calculate overall time estimation of a batched background migration

It's possible to estimate how long a BBM will take to complete. GitLab already provides an estimation through the db:gitlabcom-database-testing pipeline. This estimation is built based on sampling production data in a test environment and represents the max time that the migration could take and, not necessarily, the actual time that the migration will take. In certain scenarios, estimations provided by the db:gitlabcom-database-testing pipeline may not be enough to calculate all the singularities around the records being migrated, making further calculations necessary. As it made necessary, the formula interval * number of records / max batch size can be used to determine an approximate estimation of how long the migration will take. Where interval and max batch size refer to options defined for the job, and the total tuple count is the number of records to be migrated.

Cleaning up a batched background migration

NOTE: Cleaning up any remaining background migrations must be done in either a major or minor release. You must not do this in a patch release.

Because background migrations can take a long time, you can't immediately clean things up after queueing them. For example, you can't drop a column used in the migration process, as jobs would fail. You must add a separate post-deployment migration in a future release that finishes any remaining jobs before cleaning things up. (For example, removing a column.)

To migrate the data from column foo (containing a big JSON blob) to column bar (containing a string), you would:

  1. Release A:
    1. Create a migration class that performs the migration for a row with a given ID.
    2. Update new rows using one of these techniques:
      • Create a new trigger for simple copy operations that don't need application logic.
      • Handle this operation in the model/service as the records are created or updated.
      • Create a new custom background job that updates the records.
    3. Queue the batched background migration for all existing rows in a post-deployment migration.
  2. Release B:
    1. Add a post-deployment migration that checks if the batched background migration is completed.
    2. Deploy code so that the application starts using the new column and stops to update new records.
    3. Remove the old column.

Bump to the import/export version may be required, if importing a project from a prior version of GitLab requires the data to be in the new format.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #388789 (closed)

Edited by Leonardo da Rosa

Merge request reports