BackfillMergeRequestFileDiffsPartitionedTable: merge_request_diff_files batch background migrations triggers sidekiq memory spike (#592360) · Issues · GitLab.org / GitLab

BackfillMergeRequestFileDiffsPartitionedTable: merge_request_diff_files batch background migrations triggers sidekiq memory spike

### Summary After upgrade from 18.5.z to 18.8.z, we encounter issue where Sidekiq memory usage spiked. In Helm environment, the memory spike eventually triggers Sidekiq pod restart and this is happening constantly. This cause the whole batch background migration completion to slow down at best, or stalled completely. We have at least 4 customers reporting the same issue. They all in Helm Chart deployment. Looking at the list of batched background migration, it looks like the `BackfillMergeRequestFileDiffsPartitionedTable: merge_request_diff_files` is the one that stalled. After further review, all of them have externalDiffs disabled (which is the [default config](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/values.yaml?ref_type=heads#L482-483)) . So they keep their `merge_request_diff*` on the database. My hypothesis, the memory spike was triggered because they have a lot of data under `diff` column and somehow this backfill process use a lot of memory in sidekiq. We have [increased the sub_batch_size in 18.8](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/215846) as well to speed up the migrations but I'm not sure if this is related to the higher memory usage. Since some users have their MR diff in object storage, or their diff data are smaller in size, or run in omnibus where temporary memory spike did not cause issue, perhaps these is why we are not seeing more reports about this problem. Can we look into what causing the memory spike on this migrations? ### Steps to reproduce  ### Example Project Internal only support ticket reference: * [693417](https://gitlab.zendesk.com/agent/tickets/693417) * [691674](https://gitlab.zendesk.com/agent/tickets/691674) * [689027](https://gitlab.zendesk.com/agent/tickets/689027) * [684638](https://gitlab.zendesk.com/agent/tickets/684638) ### What is the current *bug* behavior? Batch background migrations triggers memory spike in sidekiq ### What is the expected *correct* behavior? Batch background migrations should not trigger memory spike on sidekiq ### Relevant logs and/or screenshots  ### Output of checks      #### Results of GitLab environment info  <details> <summary>Expand for output related to GitLab environment info</summary> <pre> (For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`) </pre> </details> #### Results of GitLab application Check  <details> <summary>Expand for output related to the GitLab application check</summary> <pre> (For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:check SANITIZE=true`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true`) (we will only investigate if the tests are passing) </pre> </details> ### Possible fixes A workaround that have worked is to increase the Sidekiq [SIDEKIQ_MEMORY_KILLER_MAX_RSS](https://docs.gitlab.com/administration/sidekiq/sidekiq_memory_killer/#configuring-the-limits), as well as Sidekiq pod memory resource limit to a much higher limit so that the sidekiq / sidekiq pod would not get restarted. This is however, would not work for cluster with limited resources.  ### Patch release information for backports If the bug fix needs to be backported in a [patch release](https://handbook.gitlab.com/handbook/engineering/releases/patch-releases) to a version under [the maintenance policy](https://docs.gitlab.com/policy/maintenance/), please follow the steps on the [patch release runbook for GitLab engineers](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/patch/engineers.md). Refer to the [internal "Release Information" dashboard](https://dashboards.gitlab.net/d/delivery-release_info/delivery3a-release-information?orgId=1) for information about the next patch release, including the targeted versions, expected release date, and current status. #### High-severity bug remediation To remediate high-severity issues requiring an [internal release](https://handbook.gitlab.com/handbook/engineering/releases/internal-releases/) for single-tenant SaaS instances, refer to the [internal release process for engineers](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/internal-releases/engineers.md?ref_type=heads).

issue