Identify and Kill Zombie Sidekiq Jobs
Background
There are scenarios where a sidekiq job ends up taking lot of resource, due to the scope of it, and ends up taking down sidekiq with it while not completing the job itself. Upon sidekiq restart, the job retries, gets picked up again and runs into the same issue. This can happen over and over again and is definitely not a situation we would want to be in. @andrewn reported it via: https://gitlab.slack.com/archives/CB3LSMEJV/p1563275105393100?thread_ts=1563271613.386500&cid=CB3LSMEJV and https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7262. Within the scope of this issue, let's call these jobs: "zombie" jobs. (Feel free to suggest an alternative name :) )
Goal
The goal of this issue is for SRE:On-call to run a prepared query, identify a list of such zombie jobs and kill them on a set interval until DEV team addresses: https://gitlab.com/gitlab-org/gitlab-ce/issues/35389
Query
@andrewn - to provide ELK queries here.
Kill the jobs
TBD (need to find out how to kill the jobs)