Skip to content

Optimise RelationObjectSaver to handle database timeouts during import

What does this MR do and why?

This MR introduces a retry mechanism in the RelationObjectSaver class to improve the reliability of nested record imports during Direct Transfer.

Previously, if a database statement timeout (e.g., ActiveRecord::QueryCanceled) occurred during the import of nested records (such as notes, diff notes, or events), Direct Transfer would stop processing subsequent records. This resulted in partial imports, where any records after the failure were skipped entirely and never retried.

Example: When importing a merge request with multiple nested records:

MergeRequest - Imported
 - Note1 - Imported
 - Note2 - Imported
 - Note3 - Imported
 - DiffNote1 <-- Error
 - DiffNote2
 - Event1
 - Event2

Only the records up to the failure (DiffNote1) were imported. The rest were lost in the migration.

Issue: Direct Transfer processes nested records in batches - for instance, 300 notes are divided into three batches of 100. If a timeout occurs during the second batch, that entire batch and all following batches (e.g., the third batch) are skipped, resulting in incomplete data migration.

Solution This MR adds a retry mechanism:

When a batch fails, it is automatically divided into smaller sub-batches (1/4 of the original size).

The system will attempt up to three retries.

If all retry attempts fail, only then will the affected records be skipped.

Example: Processing a Merge Request with 300 Nested Notes

The following table illustrates how one merge_request (a relation_object) containing 300 notes (a collection_subrelation) is processed using the updated retry mechanism.

Batch # Records Process Outcome Retry Count New Batch Size
Initial 300 notes move_subrelations → 3 batches - 100
Batch 1 notes[1-100] save_batch_with_retry Success 0 -
Batch 2 notes[101-200] save_batch_with_retry Timeout 0 -
Batch 2 Retry notes[101-200] process_with_smaller_batch_size → 4 batches 1 25
Batch 2.1 notes[101-125] save_batch_with_retry Success 1 -
Batch 2.2 notes[126-150] save_batch_with_retry Success 1 -
Batch 2.3 notes[151-175] save_batch_with_retry Timeout 1 -
Batch 2.3 Retry notes[151-175] process_with_smaller_batch_size → 4 batches 2 7
Batch 2.3.1 notes[151-157] save_batch_with_retry Success 2 -
Batch 2.3.2 notes[158-164] save_batch_with_retry Success 2 -
Batch 2.3.3 notes[165-171] save_batch_with_retry Success 2 -
Batch 2.3.4 notes[172-175] save_batch_with_retry Success 2 -
Batch 2.4 notes[176-200] save_batch_with_retry Success 1 -
Batch 3 notes[201-300] save_batch_with_retry Success 0 -
Result 300 records All processed Complete - -

Issue - Direct Transfer - Statement timeout causes missing records

Elastic logs - https://log.gprd.gitlab.net/app/r/s/VKXrq

NOTE : These changes are controlled by the feature flag :import_rescue_query_canceled, which is set to false by default. This issue serves as the rollout task to enable the feature flag.

References

Screenshots or screen recordings

Before After

How to set up and validate locally

To validate the changes, you can test the behaviour when an ActiveRecord::QueryCanceled exception occurs, as well as ensure the import process works as expected under normal conditions. In the local environment, I attempted to trigger the ActiveRecord::QueryCanceled exception by lowering the PostgreSQL statement_timeout to 10ms, but the exception did not occur. So to simulate the exception, I created a script that can be run from the Rails console to manually raise the ActiveRecord::QueryCanceled error and verify the changes directly from the console.

Verify the changes using the following steps:

  1. Test with Simulated Exception: Use the provided script in the Rails console to trigger the ActiveRecord::QueryCanceled exception and validate that the retry mechanism works as expected.
  2. Verify Normal Import Flow: Run a standard import without triggering any exceptions to confirm that the import process functions correctly and without issues.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #509325 (closed)

Edited by Jaydip Pansuriya

Merge request reports

Loading