SidekiqHandler should terminate parent process
What does this MR do and why?
Renames TermProcessHandler to SidekiqHandler since it is used only by Sidekiq. For Sidekiq, we are now using SidekiqCluster everywhere. SidekiqCluster uses ProcessSupervisor that will trap the Term signal and send it to all child processes. It will wait for termination, and in case the process is stuck, it will hard stop/SIGKILL it.
We want to reuse this for Sidekiq Handler. So instead of calling TERM with process pid, we want to call TERM with the parent process pid, which is SidekiqCluster.
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
This is a little tricky on local environment especially if you are using MacOS:
- Enable Watchdog for Sidekiq
export GITLAB_MEMORY_WATCHDOG_ENABLED = true
- Set MaxRSS
export SIDEKIQ_MEMORY_KILLER_MAX_RSS = 500
- In case of running on Darwin, Gitlab::Metrics::System.memory_usage_rss[:total] will return 0, because this module relies on the /proc filesystem being available. The workaround would be to monkey patch
rss_memory_limit.rb
diff --git a/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb b/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb
index ac71592294ca..600c47e25e65 100644
--- a/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb
+++ b/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb
@@ -13,7 +13,7 @@ def initialize(memory_limit_bytes:)
end
def call
- worker_rss_bytes = Gitlab::Metrics::System.memory_usage_rss[:total]
+ worker_rss_bytes = get_rss#Gitlab::Metrics::System.memory_usage_rss[:total]
return { threshold_violated: false, payload: {} } if worker_rss_bytes <= memory_limit_bytes
@@ -22,6 +22,16 @@ def call
private
+ def get_rss
+ output, status = Gitlab::Popen.popen(%W(ps -o rss= -p #{Process.pid}), Rails.root.to_s)
+
+ return 0 unless status&.zero?
+
+ output.to_i
+ end
+
def payload(worker_rss_bytes, memory_limit_bytes)
{
message: 'rss memory limit exceeded',
- Restart and tail rails-background-jobs
- Monitor processes (I am using HTOP) Watch for sidekiq and sidekiq-cluster
- After ~10 minutes you should see in logs:
{"severity":"INFO","time":"2023-01-10T15:45:09.857Z","message":"stopped","memwd_reason":"successfully handled","memwd_handler_class":"Gitlab::Memory::Watchdog::SidekiqHandler","memwd_sleep_time_s":3,"pid":36206,"worker_id":"sidekiq_0","memwd_rss_bytes":0,"retry":0}
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Relates to #387794 (closed)
Merge request reports
Activity
changed milestone to %15.8
assigned to @nmilojevic1
added devopsdata stores sectioncore platform labels
- A deleted user
added backend label
Reviewer roulette
Changes that require review have been detected!
Please refer to the table below for assigning reviewers and maintainers suggested by Danger in the specified category:
Category Reviewer Maintainer backend Krasimir Angelov (
@krasio
) (UTC+13, 12 hours ahead of@nmilojevic1
)Roy Zwambag (
@rzwambag
) (UTC+1, same timezone as@nmilojevic1
)To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot, based on their timezone. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.
If needed, you can retry the
danger-review
job that generated this comment.Generated by
Dangermarked the checklist item I have evaluated the MR acceptance checklist for this MR. as completed
@harsimarsandhu can you do the initial review?
Hey @nmilojevic1 I will review this today.
@nmilojevic1 I am not able to verify this locally, I did the steps mentioned in description and waited 10 minutes, I used command
gdk tail rails-background-jobs | grep '"message":"stopped"'
but it is not showing up any logs.Hi @harsimarsandhu, have you tried adding env variables in ~/.bash_profile if you are using gdk and macOS? Can you check in rails console if GITLAB_MEMORY_WATCHDOG_ENABLED and SIDEKIQ_MEMORY_KILLER_MAX_RSS are set correctly?
Also I rebased, in !108112 (merged) we enabled some feature flags by default, can you please try again?
requested review from @harsimarsandhu
added 2002 commits
-
44c8645e...045f039e - 2000 commits from branch
master
- 1acd3f7b - SidekiqHandler should terminate parent process
- 98e95be1 - Fix rubocop offences
-
44c8645e...045f039e - 2000 commits from branch
@mkaeppler, since you are familiar with this change, do you have the capacity to review this as a maintainer?
requested review from @mkaeppler
changed milestone to %15.9
Allure report
allure-report-publisher
generated test report!e2e-package-and-test:
test report for 98e95be1expand test summary
+------------------------------------------------------------------+ | suites summary | +-------------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +-------------+--------+--------+---------+-------+-------+--------+ | Data Stores | 11 | 0 | 3 | 0 | 14 | ✅ | +-------------+--------+--------+---------+-------+-------+--------+ | Total | 11 | 0 | 3 | 0 | 14 | ✅ | +-------------+--------+--------+---------+-------+-------+--------+
@harsimarsandhu
, thanks for approving this merge request.This is the first time the merge request is approved. To ensure full test coverage, a new pipeline will be started shortly.
For more info, please refer to the following links:
added pipeline:mr-approved label
removed review request for @harsimarsandhu
22 22 end 23 23 24 24 # This handler sends SIGTERM and considers the situation handled. @nmilojevic1 had to remind me why exactly we're doing this (i.e. TERM-ing the parent), and I think it's worth documenting that this now couples application logic with how we deploy and supervise Sidekiq instances:
We want to avoid situations where if after a TERM, the worker gets stuck and unresponsive. For Puma, this is behavior we get for free since the
PumaHandler
uses Puma-internal functionality to eventually SIGKILL a worker that is unresponsive. For Sidekiq, we rely on sidekiq-cluster for this, which runs a similar supervision loop, but this also means that the memory watchdog now makes assumptions that this will continue to be the case. If we would run Sidekiq processes outside of sidekiq-cluster, we would suddenly TERM a different/unknown parent process. So this presents with a coupling problem.Let's document this in code for now, something like:
24 # This handler sends SIGTERM and considers the situation handled. 24 # This handler sends SIGTERM to the worker's parent process. We TERM the parent instead of the worker 25 # because we assume that the Sidekiq worker is supervised by sidekiq-cluster, which traps TERMs 26 # and restarts the entire process group and sends KILL to unresponsive workers eventually. 27 # This means that if we ever run workers outside of sidekiq-cluster, we need to update this handler 28 # to _not_ kill the parent anymore, but the worker itself.
23 23 24 24 # This handler sends SIGTERM and considers the situation handled. 25 class TermProcessHandler 26 def initialize(pid = $$) 27 @pid = pid 25 class SidekiqHandler 26 def initialize 27 @parent_pid = Process.ppid 28 28 end 29 29 30 30 def call 31 Process.kill(:TERM, @pid) 31 # We are sending TERM signal to the parent process, because we are using the Sidekiq Cluster 32 # The ProcessSupervisor will trap the Term signal, and send it to all child processes. 33 # It will wait for termination, and if process is stuck, it will hard stop/kill it. 34 Process.kill(:TERM, @parent_pid) Thanks, @mkaeppler. After discussing in slack, we want to avoid making things worse for self-managed customers. Although currently, compared to the SidekiqMemoryKiller, the NET result would be the same, as SidekiqCluster would eventually terminate all processes if one of the processes is terminated, we found it odd that (in case of self-managed customer) the SidekiqCluster would terminate all workers, in case on of them is running out of RSS. I will mimick the Sidekiq Memory killer behavior to have the clean cut off, and investigate why SidekiqCluster kills all the workers.
mentioned in issue #387794 (closed)
reset approvals from @mkaeppler and @harsimarsandhu by pushing to the branch