Skip to content

Adjust sub-batch size for failed Batched Background Migration Jobs

What does this MR do and why?

Overview

Reduces the sub_batch_size from BatchedMigrationJob when a timeout happens during sub batch processing.

It rescues the following exceptions:

ActiveRecord::StatementTimeout
ActiveRecord::ConnectionTimeoutError
ActiveRecord::AdapterTimeout
ActiveRecord::LockWaitTimeout
ActiveRecord::QueryCanceled

Solves #377308 (closed)

Feature Flag Issue: #393556 (closed)

Details

If a timeout happens while processing each_sub_batch, a Gitlab::Database::BackgroundMigration::SubBatchTimeoutError error will be raised. This error will be rescued by the migration wrapper and processed by BatchedJob#reduce_sub_batch_size!, which will reduce the sub batch size in 25%:

  • BatchedJob#sub_batch_size will never goes lower than batch_size
  • BatchedJob#sub_batch_size will be reduced 2 times - or 44% - before the cycle being reset by BatchedJob#split_and_retry!
  • The cycle happens while changing the state of BatchedMigrationJob to :failed

How to set up and validate locally

  1. Create a new background migration: rails g post_deployment_migration AdjustSubBatchSizeOnTimeout
Example
class AdjustSubBatchSizeOnTimeout < Gitlab::Database::Migration[2.1]
  MIGRATION = 'AdjustSubBatchSizeOnTimeout'
  TABLE_NAME = :issues
  BATCH_COLUMN = :id
  BATCH_SIZE = 500
  SUB_BATCH_SIZE = 150

  restrict_gitlab_migration gitlab_schema: :gitlab_main

  def up
    queue_batched_background_migration(
      MIGRATION,
      TABLE_NAME,
      BATCH_COLUMN,
      batch_size: BATCH_SIZE,
      sub_batch_size: SUB_BATCH_SIZE,
      job_interval: 2.minutes
    )
  end

  def down
    delete_batched_background_migration(MIGRATION, TABLE_NAME, BATCH_COLUMN, [])
  end
end
  1. Create a new class to process the migration:
Example
module Gitlab
  module BackgroundMigration
    class AdjustSubBatchSizeOnTimeout < BatchedMigrationJob
      operation_name :update_all
      feature_category :database

      def perform
        each_sub_batch do |_|
          Issue.transaction do
            Issue.connection.execute 'SET statement_timeout = 10'
            issue = Issue.lock.find(1)
            Logger.new($stdout).info('Lock on Issue(1) for 10min.')
            issue.connection.execute('SELECT * FROM pg_sleep(600);')
          end
        end
      end
    end
  end
end
  1. Run rails db:migrate. On the output, check for:
Caused by:
  PG::QueryCanceled: ERROR:  canceling statement due to statement timeout
  1. Open the console and check for the first created retriable job and check it's sub_batch_size. Should be reduced by 25%:
base_model = Gitlab::Database.database_base_models[:main]
migration = Gitlab::Database::BackgroundMigration::BatchedMigration.active_migration(connection: base_model.connection)
retriable_job = migration.batched_jobs.retriable.first

retriable_job.status
=> 2 #failed

retriable_job.sub_batch_size
=> 112 # 150 - 25% = 112,5
  1. Re-try failed job
migration_wrapper = Gitlab::Database::BackgroundMigration::BatchedMigrationWrapper.new(connection: base_model.connection)
migration_wrapper.perform(retriable_job)

retriable_job.status
=> 2 #failed

retriable_job.sub_batch_size
=> 112 # 150 - 25% = 84

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #377308 (closed)

Edited by Leonardo da Rosa

Merge request reports