Remove hardcoded time limit for migrations to complete - Direct Transfer
Release notes
Direct Transfer migrations may sometimes encounter issues that cause them to get stuck. This can happen due to various reasons. To avoid leaving these migrations in an incomplete state indefinitely, a worker called BulkImports::StuckImportWorker is executed periodically. This worker identifies migrations not completed within 8 hours and marks them as timed out.
The 8-hour period may not be sufficient to determine if a migration is stuck, especially for large organizations where the migration can take longer. As a result, the StuckImportWorker worker may incorrectly mark the migration as stuck.
In this milestone, instead of using a fixed period, we mark migration as stuck only if the child workers unexpectedly stopped working.
Context
Direct Transfer migrations may sometimes encounter issues that cause them to get stuck. This can happen due to various reasons, which I explain below. To avoid leaving these migrations in an incomplete state indefinitely, a worker called BulkImports::StuckImportWorker is executed periodically. This worker identifies migrations not completed within 8 hours and marks them as timed out.
Reasons for migration to get stuck
- A child worker might be interrupted or terminated during Sidekiq restarts, preventing it from completing its task and causing the parent/monitor worker not to progress the migration.
- A child worker fails multiple times due to a non-handled exception, also causing the parent/monitor worker not to progress the migration.
- Parent/monitor worker stops being re-enqueued due to a non-handled exception getting incorrectly deduplicated.
Problem
The 8-hour period may not be sufficient to determine if a migration is stuck, especially for large organizations where the migration can take longer. As a result, the StuckImportWorker worker may incorrectly mark the migration as stuck.
Proposed solution
Instead of using a fixed period, we should mark migration as stuck only if the child workers unexpectedly stopped working. Since it's a challenge to use jobs' JID in Direct Transfer as we manually re-enqueue workers and they keep changing the JID, we can make the child workers update the expiration time of known key/value on Redis occasionally. For example, the expiration time can be periodically updated to 8 hours.
To implement this, we will work with the following Redis keys:
- "bulk_imports/worker_status/bulk_import/%{id}"
- "bulk_imports/worker_status/bulk_import_entity/%{id}"
Here's a breakdown of the necessary updates:
-
BulkImports::EntityWorker: Update the worker to refresh the "bulk_import_status" key every time it's re-enqueued.
-
BulkImports::Pipeline::Runner: Modify the runner to update the "bulk_import_entity" key for every 1000 records that are migrated and when the migration process is completed.
-
BulkImports::PipelineWorker: Update the worker to refresh the "bulk_import_entity" key each time it's re-enqueued.
-
StuckImportWorker: Update the worker to only mark the bulk_import or bulk_import_entity as timed out if the keys expired.