Sidekiq memory limits and memory killer in kubernetes
After https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10930 I've been thinking about the sidekiq memory killer in kubernetes. I accept that in kubernetes the native memory limits are the One True Way to kill off pods that have exceeded their allocation, and I see gitlab-org/charts/gitlab#1692 showing that the memory killer is tricksy in kubernetes in the past.
However, terminating a pod for exceeding the memory limit sends a SIGKILL (immediate termination) which gives no opportunity for sidekiq to let jobs complete or get requeued. When this happens, interrupted jobs only eventually get picked up by the reliable fetcher (runs hourly by default, and this was part of the issue in 10930). We've currently set the memory limits for the sidekiq pods quite a bit higher (3-8GB) than the value the old sidekiq memory killer limit was set to (~2GB), so in practice I don't expect sidekiq to get cgroup OOM-killed all that often (especially once the current PagesWorker memory problem is fixed) without the memory killer, but it still makes me a bit twitchy. Even if we made the reliable fetcher cleanup run more often (every few minutes?), that could be a long time to wait for an urgent
job to get re-scheduled, with a customer waiting on the UI to update and getting impatient. And I acknowledge that our preferred scenario is that sidekiq never leaks or gets OOM-killed, but we don't always get what we want.
I think we need to consider making the wrapper script container smarter and able to restart cleanly, then enable one of the memory killers again (legacy or daemon, not sure which yet, but possibly the smarter daemon one), with limits configured sufficiently below the actual k8s memory limits for each shard such that if something gets a bit out of hand, or there's a gradual leak, we'll get a clean shutdown with jobs smartly re-queued for best results. We should also monitor for this happening and alert if it starts occurring "too often".
@skarbek You've been knee deep in this for a while; what do you think?