Reduce Direct Transfer jobs execution duration
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Direct Transfer BulkImports::PipelineWorker
and BulkImports::PipelineBatchWorker
jobs may run for an extended period, which isn't ideal for several important reasons:
- Long-running jobs accumulate more memory usage over time, leading to memory bloat.
- If jobs take too long, they can block threads, reducing the ability to process other jobs efficiently, especially on self-managed instances where Sidekiq jobs are not prioritized by default.
- Problematic jobs continue running even after the migration is canceled.
- Long-running jobs are more likely interrupted by Sidekiq restarts caused by Sidekiq Memory Killer or code deployment. Although this is less important, on 17.5, the maximum interruption limit was increased from 3 to 20.
Based on the Kibana logs, an execution of BulkImports::PipelineBatchWorker
took 9000 seconds (2.5 hours), and in 99% of the executions, it took longer than 2120 seconds (36 minutes) on GitLab.com.
When migrating the NodeJS project to the 2k instance, the jobs took much longer, sometimes more than 4 hours, since the instance has fewer available resources than GitLab.com.
Proposed solution
To reduce job execution time, one approach is to re-enqueue the job at regular intervals, such as every 5 minutes. This method is suitable for NDJSON pipeline jobs as they can resume from where they left off. However, for this approach to be effective, we must update Direct Transfer to store the downloaded relation exported files at the destination. Otherwise, the files would need to be downloaded again each time the pipeline is restarted. This update is necessary for the air-gapped migration. Additionally, it will allow us to directly read files from the object store without needing to decompress them.