More reliable Sidekiq queue

Sidekiq uses BRPOP command to pop a job off the queue in Redis. It means that job is removed from the queue, usually, that's not a problem because if our job raises an error the Sidekiq puts it back with "failed" status(to another queue) and it will be retried later. But it becomes a problem when we kill sidekiq or when it segfaults or crashes. In this case, we lose that job forever.

This made me to take a look at how we kill our sidekiq processes. We use https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/sidekiq_middleware/memory_killer.rb for that. There is a variable SIDEKIQ_MEMORY_KILLER_SHUTDOWN_WAIT which means how long we wait from stopping accepting new jobs until we kill the process. So we give a job at maximum 30 seconds to finish its stuff. After this, we will kill that job and we'll never retry it (see above the reason). I think there is no reason to be so aggressive because sometimes our jobs exceed that time http://performance.gitlab.net/dashboard/db/sidekiq-workers. I propose to set it to 90 seconds.

The second optimization would be to increase the value of SIDEKIQ_MEMORY_KILLER_GRACE_TIME, which now is 15 minutes. That means that we wait 15 minutes after we write a warning to the log this thread will shut down PID #{Process.pid} - Worker #{worker.class} - JID-# in #{GRACE_TIME} seconds. Basically, with the current implementation, I don't see any reason to keep this fat process running. For now, I propose to set it to several seconds. Later, we can improve it by adding one more get_rss call after GRACE_TIME to make sure that memory was flooded by memory leaks, not by one fat job that can free resources afterward. That will make our shots more accurate. But I don't think the last one will have practical benefits in real life.

Summary

Now we have the following picture: process exceeds memory limit -> we write warning to the log -> wait 15 minutes -> stop accepting new jobs -> wait 30 seconds -> shutting everything down.

I propose to set SHUTDOWN_WAIT to 90 seconds (default 30) and GRACE_TIME to 10 seconds(now 15 minutes). What to expect?

Decreasing memory consumption by not keeping fat processes 15 minutes before killing.
More reliable queue by letting job more time to finish its stuff.

It should close issue https://gitlab.com/gitlab-org/gitlab-ce/issues/23646. And I'm also sure that everyone saw lost jobs, at least I saw many times.

/cc @jacobvosmaer-gitlab @pcarranza @jnijhof

REFERENCES:

https://github.com/mperham/sidekiq/wiki/Error-Handling

https://github.com/mperham/sidekiq/wiki/Reliability

Edited Sep 10, 2018 by Valery Sizov