Skip to content

Remove unsafe any_jobs check

What does this MR do and why?

Removes the any_job? check from the signal_and_wait function for restarting sidekiq when memory usage goes above a specific limit.

The restart_sidekiq function calls signal_and_wait function 3 times to kill the sidekiq process with increasingly higher priority system calls. The signal_and_wait function will only sleep if there are any_jobs?. In the situation where there is no jobs, the code will travel right through to a Kill -9 before the first signal has been handled.

We came across this in our logs a couple of months back when trying to diagnose a problem where a resource group got stuck. We tracked it down to an issue with this job. In the logs we would see it being deduplicated as a duplicate of a job that we had not record of.

As we had a bad memory limit set (too low - 1G), we saw sidekiq restart every 15 minutes, and we would see a resource group get stuck nearly every week.

We concluded that sidekiq was in the process of accepting a job when it was killed by the memory killer.

This MR represents the fix we have monkey patched on a 14.9 self-hosted install, and it has been stable now for a couple of months.

Upon revisiting this issue, it might simply be the case that the && should become ||, but this MR is what we have running and could serve as a conversation starter.

How to set up and validate locally

We were unable (due to lack of knowledge probably) to prove this locally, but as mentioned above, this is the fix we have working in a production environment. (exact patch for 14.9 is shared as an attachment memory_killer.patch)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #381139 (closed)

Edited by Alastair McClelland

Merge request reports