BackfillMergeRequestFileDiffsPartitionedTable: merge_request_diff_files batch background migrations triggers sidekiq memory spike
Summary
After upgrade from 18.5.z to 18.8.z, we encounter issue where Sidekiq memory usage spiked. In Helm environment, the memory spike eventually triggers Sidekiq pod restart and this is happening constantly. This cause the whole batch background migration completion to slow down at best, or stalled completely.
We have at least 4 customers reporting the same issue. They all in Helm Chart deployment.
Looking at the list of batched background migration, it looks like the BackfillMergeRequestFileDiffsPartitionedTable: merge_request_diff_files is the one that stalled.
After further review, all of them have externalDiffs disabled (which is the default config) . So they keep their merge_request_diff* on the database.
My hypothesis, the memory spike was triggered because they have a lot of data under diff column and somehow this backfill process use a lot of memory in sidekiq. We have increased the sub_batch_size in 18.8 as well to speed up the migrations but I'm not sure if this is related to the higher memory usage.
Since some users have their MR diff in object storage, or their diff data are smaller in size, or run in omnibus where temporary memory spike did not cause issue, perhaps these is why we are not seeing more reports about this problem.
Can we look into what causing the memory spike on this migrations?
Steps to reproduce
Example Project
Internal only support ticket reference:
What is the current bug behavior?
Batch background migrations triggers memory spike in sidekiq
What is the expected correct behavior?
Batch background migrations should not trigger memory spike on sidekiq
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Possible fixes
A workaround that have worked is to increase the Sidekiq SIDEKIQ_MEMORY_KILLER_MAX_RSS, as well as Sidekiq pod memory resource limit to a much higher limit so that the sidekiq / sidekiq pod would not get restarted. This is however, would not work for cluster with limited resources.
Patch release information for backports
If the bug fix needs to be backported in a patch release to a version under the maintenance policy, please follow the steps on the patch release runbook for GitLab engineers.
Refer to the internal "Release Information" dashboard for information about the next patch release, including the targeted versions, expected release date, and current status.
High-severity bug remediation
To remediate high-severity issues requiring an internal release for single-tenant SaaS instances, refer to the internal release process for engineers.