Skip to content
Snippets Groups Projects

SidekiqHandler should terminate parent process

Closed Nikola Milojevic requested to merge sidekiq_handler_should_terminate_parent_process into master
4 unresolved threads

What does this MR do and why?

Renames TermProcessHandler to SidekiqHandler since it is used only by Sidekiq. For Sidekiq, we are now using SidekiqCluster everywhere. SidekiqCluster uses ProcessSupervisor that will trap the Term signal and send it to all child processes. It will wait for termination, and in case the process is stuck, it will hard stop/SIGKILL it.

We want to reuse this for Sidekiq Handler. So instead of calling TERM with process pid, we want to call TERM with the parent process pid, which is SidekiqCluster.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

This is a little tricky on local environment especially if you are using MacOS:

  1. Enable Watchdog for Sidekiq export GITLAB_MEMORY_WATCHDOG_ENABLED = true
  2. Set MaxRSS export SIDEKIQ_MEMORY_KILLER_MAX_RSS = 500
  3. In case of running on Darwin, Gitlab::Metrics::System.memory_usage_rss[:total] will return 0, because this module relies on the /proc filesystem being available. The workaround would be to monkey patch rss_memory_limit.rb
diff --git a/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb b/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb
index ac71592294ca..600c47e25e65 100644
--- a/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb
+++ b/lib/gitlab/memory/watchdog/monitor/rss_memory_limit.rb
@@ -13,7 +13,7 @@ def initialize(memory_limit_bytes:)
           end
 
           def call
-            worker_rss_bytes = Gitlab::Metrics::System.memory_usage_rss[:total]
+            worker_rss_bytes = get_rss#Gitlab::Metrics::System.memory_usage_rss[:total]
 
             return { threshold_violated: false, payload: {} } if worker_rss_bytes <= memory_limit_bytes
 
@@ -22,6 +22,16 @@ def call
 
           private
 
+          def get_rss
+            output, status = Gitlab::Popen.popen(%W(ps -o rss= -p #{Process.pid}), Rails.root.to_s)
+        
+            return 0 unless status&.zero?
+
+            output.to_i
+          end
+
           def payload(worker_rss_bytes, memory_limit_bytes)
             {
               message: 'rss memory limit exceeded',
  1. Restart and tail rails-background-jobs
  2. Monitor processes (I am using HTOP) Watch for sidekiq and sidekiq-cluster image
  3. After ~10 minutes you should see in logs:
{"severity":"INFO","time":"2023-01-10T15:45:09.857Z","message":"stopped","memwd_reason":"successfully handled","memwd_handler_class":"Gitlab::Memory::Watchdog::SidekiqHandler","memwd_sleep_time_s":3,"pid":36206,"worker_id":"sidekiq_0","memwd_rss_bytes":0,"retry":0}

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Relates to #387794 (closed)

Edited by Nikola Milojevic

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
22 22 end
23 23
24 24 # This handler sends SIGTERM and considers the situation handled.
  • @nmilojevic1 had to remind me why exactly we're doing this (i.e. TERM-ing the parent), and I think it's worth documenting that this now couples application logic with how we deploy and supervise Sidekiq instances:

    We want to avoid situations where if after a TERM, the worker gets stuck and unresponsive. For Puma, this is behavior we get for free since the PumaHandler uses Puma-internal functionality to eventually SIGKILL a worker that is unresponsive. For Sidekiq, we rely on sidekiq-cluster for this, which runs a similar supervision loop, but this also means that the memory watchdog now makes assumptions that this will continue to be the case. If we would run Sidekiq processes outside of sidekiq-cluster, we would suddenly TERM a different/unknown parent process. So this presents with a coupling problem.

    Let's document this in code for now, something like:

    Suggested change
    24 # This handler sends SIGTERM and considers the situation handled.
    24 # This handler sends SIGTERM to the worker's parent process. We TERM the parent instead of the worker
    25 # because we assume that the Sidekiq worker is supervised by sidekiq-cluster, which traps TERMs
    26 # and restarts the entire process group and sends KILL to unresponsive workers eventually.
    27 # This means that if we ever run workers outside of sidekiq-cluster, we need to update this handler
    28 # to _not_ kill the parent anymore, but the worker itself.
  • Please register or sign in to reply
  • 23 23
    24 24 # This handler sends SIGTERM and considers the situation handled.
    25 class TermProcessHandler
    26 def initialize(pid = $$)
    27 @pid = pid
    25 class SidekiqHandler
    26 def initialize
    27 @parent_pid = Process.ppid
    28 28 end
    29 29
    30 30 def call
    31 Process.kill(:TERM, @pid)
    31 # We are sending TERM signal to the parent process, because we are using the Sidekiq Cluster
    32 # The ProcessSupervisor will trap the Term signal, and send it to all child processes.
    33 # It will wait for termination, and if process is stuck, it will hard stop/kill it.
    34 Process.kill(:TERM, @parent_pid)
  • Looks good, thanks!

  • Matthias Käppler approved this merge request

    approved this merge request

  • Thanks, @mkaeppler. After discussing in slack, we want to avoid making things worse for self-managed customers. Although currently, compared to the SidekiqMemoryKiller, the NET result would be the same, as SidekiqCluster would eventually terminate all processes if one of the processes is terminated, we found it odd that (in case of self-managed customer) the SidekiqCluster would terminate all workers, in case on of them is running out of RSS. I will mimick the Sidekiq Memory killer behavior to have the clean cut off, and investigate why SidekiqCluster kills all the workers.

  • mentioned in issue #387794 (closed)

  • Kev Kloss reset approvals from @mkaeppler and @harsimarsandhu by pushing to the branch

    reset approvals from @mkaeppler and @harsimarsandhu by pushing to the branch

  • Please register or sign in to reply
    Loading