Investigate Memory Leak - Sidekiq memory killer being invoked more often recently
Original issue from Infra - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10922
Summary of notes
- suspect a big burst from a large job, rather than a gradual memory leak
-
UpdatePagesService
uses RubyZip, so that may be a source of memory bloat.- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10922#note_384227804
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10922#note_384240597
-
UpdatePagesService
has a plan to removeextract_zip_archive!
soon(in 1~2 release?), at that time this won't be a concern anymore
- RubyZip uses
Zlib
. The root cause is inZlib::Inflate
. -
Zlib::inflate
is also used in import file size check: https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/562#note_387863920.
The current progress:
- A feature request has been raised to Zlib to reduce the Max RSS in streaming inflate. Zlib has completed the implementation https://github.com/ruby/zlib/issues/19#issuecomment-721544181. More details in #231534 (comment 443219098)
Possible next TODOs: as mentioned in #231534 (comment 443219098), once the Zlib enhancement is released. We can:
-
gem install zlib
to make it available to our Ruby - remove manual GC from our import file size check https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/562
- So far we do not see anywhere uses RubyZip extract(as said above, UpdatePagesService will stop using RubyZip extract soon). If we use RubyZip extract somewhere, we can make changes to let RubyZip benefit from this feature as well, and it would be nice to contribute the change to RubyZip upstream.
Availability and Testing:
- Work with dev to identify what actions specifically are causing the memory leaks this time.
- Set up an environment with monitoring
- Test Sidekiq on the environment by performing those actions again in the same manner they occur on .com
- While actions are running monitor memory usage to confirm if fixed. Tests could be repeated on Staging and Canary as long as monitor data can be accessed.
- Optionally do Quality performance tests (on test reference architecture environments) if the tests cover the same known problem areas (we don't have test data for everything yet, pages in particular is something we don't cover).
cc @gl-memory
Edited by Qingyu Zhao