Create separate daemon process for Sidekiq Memory Killer monitoring
Problem to solve
In issue #34547 (closed), user reported a scenario that, Sidekiq Memory Killer failed to kill Sidekiq process as expected.
- Sidekiq memory killer thread detected high memory usage
- Sidekiq memory killer thread send SIGTSTP and SIGTERM to Sidekiq worker process
- Sidekiq worker process received SIGTSTP and SIGTERM
- However something is wrong, Sidekiq worker process does not terminate. And memory killer thread did not send SIGKILL as the last resort. (We are not sure the reason why SIGKILL is not sent: maybe memory killer thread is terminated, or maybe memory killer thread never get scheduled/wakeup any more).
There are two issues here:
- It is subtle to debug. It is not easy to get much information from user's production environment for root cause analysis. We cannot reproduce it either.
- Memory killer thread is not able to send SIGKILL to Sidekiq Worker.
The reason of issue 2), is highly because: Sidekiq memory killer thread itself is one child thread of Sidekiq Worker process. So when master thread has something wrong, child thread may not able to work as the last resort.
The idea is: to fork a new child process to send SIGKILL to Sidekiq Worker. If this forked child process can run independently from Sidekiq worker, it will make sure to send SIGKILL when Sidekiq Worker hang for any reason.
Intended users
Further details
This is to help developer's debugging subtle scenario for Sidekiq memory killer. Customer should not need to care about this tool.
As possible future direction, this tool can be generalised to be a diagnostic information collector for any/all gitlab packages. It could make us easier to support our user.
Proposal
There are several possible options to create the new monitor process:
- create a new script
sidekiq_monitor. It runs in parallel with sidekiq.- Advantage: this definitely is
independentdaemon process that won't be impacted by Sidekiq. - Dis-advantage: it requires inter-process communication, to know the Sidekiq worker process id, etc
- Advantage: this definitely is
- fork a new process from existing Sidekiq memory killer thread.
- Advantage: less code change; easy to retrieve Sidekiq worker process information.
- Dis-advantage: whether this approach work or not, it depends on Sidekiq implementation, not sure whether Sidekiq will terminate all child process. Need to research/try.
Besides send SIGKILL to Sidekiq Worker process, the new monitor process will also collect context(like ps etc) for root cause analysis, and log it. When user encounter issue, we can ask them to provide this log.
Documentation
Testing
This is low risk change, the worst case is: the process is terminated by sidekiq worker, so Sidekiq memory killer behaves the same as current version.
So far, we have not really replicate such a scenario ourselves. But I think we can verify it by:
- Try to hook a long running
at_exithandler in Sidekiq. To see whether we can reproduce this behaviour - we can manually send SIGTERM to Sidekiq, observe Sidekiq worker process terminated, BUT, the
new monitor processshould still alive working normally - we can manually send SIGKILL to Sidekiq, observe Sidekiq worker process terminated, BUT, the
new monitor processshould still alive working normally
What does success look like, and how can we measure that?
If Sidekiq Worker process hangs after it received SIGTERM, the new monitor process will send SIGKILL to kill Sidekiq Worker process.