Add Sidekiq daemon memory killer
What does this MR do?
This is the EE MR created from gitlab-foss!32469 (closed)
Notes from the original issue:
Step 1) Make Sidekiq memory monitor(including killer) to be an independent process(or thread). This has several benefits:
- 1.1) it will monitor Sidekiq memory usage more accurately in a desired frequency. Today's Sidekiq memory killer check Sidekiq RSS usage after one job finish, which has no RSS observation during the job -- this is not ideal especially for long running jobs.
- 1.2) it can be less cost. Consider those very simple/fast job, today we always shell-out
psto check Sidekiq RSS after every job. Consider when there are N small jobs processed, we introduced N
psshell-out. If we have one process/thread checking Sidekiq RSS every 3 seconds(configurable), it is lower cost.
- 1.3) make it able(or much easier) to observe memory usage over time period. This make it possible to allow
temporary(configurable short time length) memory balloon.
- 1.4) Still need a HARD_LIMIT_RSS_MAX(configurable), but this most likely will be much larger than today's Sidekiq MAX_RSS. Once this hard limit is reached, sidekiq process will be killed immediately(?) The purpose is to avoid triggering linux OOM, which is much worse than kill Sidekiq.
Step 2) Dynamically calculate/adjust Sidekiq memory limit, based on spawned running jobs. This may give us idea whether Sidekiq RSS is
healthy before it reaches
HARD_LIMIT_RSS_MAX. Over time if we have more confidence, we can kill Sidekiq process even earlier before it is getting more serious.
- 2.1) factors might include: worker class of jobs running, and running time. For example, the expected RSS will increase when
RepositoryImportWorkerkeep running long time.
- 2.2) may need to extend today's
SidekiqMonitorto keep more information(like the job started time)? Maybe sidekiq already recorded such information.