Optimise RelationObjectSaver to handle database timeouts during import
What does this MR do and why?
This MR introduces a retry mechanism in the RelationObjectSaver
class to improve the reliability of nested record imports during Direct Transfer.
Previously, if a database statement timeout (e.g., ActiveRecord::QueryCanceled
) occurred during the import of nested records (such as notes, diff notes, or events), Direct Transfer would stop processing subsequent records. This resulted in partial imports, where any records after the failure were skipped entirely and never retried.
Example: When importing a merge request with multiple nested records:
MergeRequest - Imported
- Note1 - Imported
- Note2 - Imported
- Note3 - Imported
- DiffNote1 <-- Error
- DiffNote2
- Event1
- Event2
Only the records up to the failure (DiffNote1) were imported. The rest were lost in the migration.
Issue: Direct Transfer processes nested records in batches - for instance, 300 notes are divided into three batches of 100. If a timeout occurs during the second batch, that entire batch and all following batches (e.g., the third batch) are skipped, resulting in incomplete data migration.
Solution This MR adds a retry mechanism:
When a batch fails, it is automatically divided into smaller sub-batches (1/4 of the original size).
The system will attempt up to three retries.
If all retry attempts fail, only then will the affected records be skipped.
Example: Processing a Merge Request with 300 Nested Notes
The following table illustrates how one merge_request
(a relation_object
) containing 300 notes
(a collection_subrelation
) is processed using the updated retry mechanism.
Batch # | Records | Process | Outcome | Retry Count | New Batch Size |
---|---|---|---|---|---|
Initial | 300 notes | move_subrelations | → 3 batches | - | 100 |
Batch 1 | notes[1-100] | save_batch_with_retry |
|
0 | - |
Batch 2 | notes[101-200] | save_batch_with_retry |
|
0 | - |
Batch 2 Retry | notes[101-200] | process_with_smaller_batch_size | → 4 batches | 1 | 25 |
Batch 2.1 | notes[101-125] | save_batch_with_retry |
|
1 | - |
Batch 2.2 | notes[126-150] | save_batch_with_retry |
|
1 | - |
Batch 2.3 | notes[151-175] | save_batch_with_retry |
|
1 | - |
Batch 2.3 Retry | notes[151-175] | process_with_smaller_batch_size | → 4 batches | 2 | 7 |
Batch 2.3.1 | notes[151-157] | save_batch_with_retry |
|
2 | - |
Batch 2.3.2 | notes[158-164] | save_batch_with_retry |
|
2 | - |
Batch 2.3.3 | notes[165-171] | save_batch_with_retry |
|
2 | - |
Batch 2.3.4 | notes[172-175] | save_batch_with_retry |
|
2 | - |
Batch 2.4 | notes[176-200] | save_batch_with_retry |
|
1 | - |
Batch 3 | notes[201-300] | save_batch_with_retry |
|
0 | - |
Result | 300 records | All processed |
|
- | - |
Issue - Direct Transfer - Statement timeout causes missing records
Elastic logs - https://log.gprd.gitlab.net/app/r/s/VKXrq
NOTE : These changes are controlled by the feature flag :import_rescue_query_canceled
, which is set to false by default. This issue serves as the rollout task to enable the feature flag.
References
Screenshots or screen recordings
Before | After |
---|---|
How to set up and validate locally
To validate the changes, you can test the behaviour when an ActiveRecord::QueryCanceled
exception occurs, as well as ensure the import process works as expected under normal conditions. In the local environment, I attempted to trigger the ActiveRecord::QueryCanceled
exception by lowering the PostgreSQL statement_timeout
to 10ms
, but the exception did not occur. So to simulate the exception, I created a script that can be run from the Rails console to manually raise the ActiveRecord::QueryCanceled
error and verify the changes directly from the console.
Verify the changes using the following steps:
-
Test with Simulated Exception: Use the provided script in the Rails console to trigger the
ActiveRecord::QueryCanceled
exception and validate that the retry mechanism works as expected. - Verify Normal Import Flow: Run a standard import without triggering any exceptions to confirm that the import process functions correctly and without issues.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #509325 (closed)