Sidekiq stuck after SIGTSTP

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Gitlab CE 12.3.4, running on k8s with helm chart install.

I have two sidekiq instances, both having the same issue a couple days apart. I'm not sure why the sidekiq processes received the SIGTSTP in the first place, but clearly sidekiq is unable to terminate.

2019-12-31T03:22:44.446Z 8 TID-ouo9fdg70 PipelineProcessWorker JID-18078f7f797961262cb526ea INFO: done: 0.605 sec
2019-12-31T03:22:44.621Z 8 TID-ouo6ncsqk Namespaces::ScheduleAggregationWorker JID-b31fbe5e2e13dc66b0613ba4 INFO: start
2019-12-31T03:22:44.647Z 8 TID-ouo6ncsqk Namespaces::ScheduleAggregationWorker JID-b31fbe5e2e13dc66b0613ba4 INFO: done: 0.026 sec
2019-12-31T03:22:44.653Z 8 TID-ountdduew ExpirePipelineCacheWorker JID-88b0a4c9ce673c406cb953ff INFO: done: 0.315 sec
2019-12-31T03:22:44.764Z 8 TID-grfv42j8k ArchiveTraceWorker JID-8fa2975fa5d6c01a8f016f47 INFO: done: 0.431 sec
2019-12-31T03:22:44.837Z 8 TID-grftc8wwc PipelineNotificationWorker JID-2591ac4029e5e4bc1611e0e3 INFO: done: 0.42 sec
2019-12-31T03:23:36.755Z 8 TID-ounti7pw4 ProjectCacheWorker JID-42e7b36039f0277d536e0a8a INFO: start
Updating statistics for project 4
2019-12-31T03:23:37.588Z 8 TID-ounti7pw4 ProjectCacheWorker JID-42e7b36039f0277d536e0a8a INFO: done: 0.834 sec
2019-12-31T03:23:37.969Z 8 TID-ouo6ncsqk GitGarbageCollectWorker JID-a5fe7bf2cf636d6af84e1700 INFO: start
2019-12-31T03:23:38.274Z 8 TID-ouo6ncsqk GitGarbageCollectWorker JID-a5fe7bf2cf636d6af84e1700 INFO: done: 0.305 sec
2019-12-31T03:25:07.671Z 8 TID-ouog3qd48 PipelineScheduleWorker JID-f2ae9f946db538da575c53dd INFO: start
2019-12-31T03:25:07.683Z 8 TID-ouog3qd48 PipelineScheduleWorker JID-f2ae9f946db538da575c53dd INFO: done: 0.013 sec
2019-12-31T03:26:20.254Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGTSTP", :message=>"sending Sidekiq worker PID-8 SIGTSTP (stop fetching new jobs)"}
2019-12-31T03:26:20.290Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGTERM", :message=>"waiting 30 seconds before sending Sidekiq worker PID-8 SIGTERM (gracefully shut down)"}
2019-12-31T03:26:20.308Z 8 TID-grfsd9jtg INFO: Received TSTP, no longer accepting new work
2019-12-31T03:26:20.308Z 8 TID-grfv4eer4 INFO: Scheduler exiting...
2019-12-31T03:26:20.308Z 8 TID-grfsd9jtg INFO: Terminating quiet workers
2019-12-31T03:26:20.309Z 8 TID-grfuao3fs INFO: Scheduler exiting...
2019-12-31T03:26:50.290Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGTERM", :message=>"sending Sidekiq worker PID-8 SIGTERM (gracefully shut down)"}
2019-12-31T03:26:50.290Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGKILL", :message=>"waiting 7 seconds before sending Sidekiq worker PID-8 SIGKILL (die)"}
2019-12-31T03:26:50.291Z 8 TID-grfsd9jtg INFO: Shutting down
2019-12-31T03:26:50.791Z 8 TID-grfsd9jtg INFO: Pausing to allow workers to finish...
2019-12-31T03:26:54.813Z 8 TID-grfsd9jtg WARN: Terminating 7 busy worker threads
2019-12-31T03:26:54.814Z 8 TID-grfsd9jtg WARN: Work still in progress [#<struct Sidekiq::BaseReliableFetch::UnitOfWork queue="queue:update_merge_requests", job="{\"class\":\"UpdateMergeRequestsWorker\",\"args\" <snip>
2019-12-31T03:26:54.825Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 4bb93c97188e306e12cfa969 back to queue queue:update_merge_requests", :jid=>"4bb93c97188e306e12cfa969", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.827Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 85e1d03215b6167c961007b6 back to queue queue:update_merge_requests", :jid=>"85e1d03215b6167c961007b6", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.828Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job f34c07b6bde48d99f129c1db back to queue queue:update_merge_requests", :jid=>"f34c07b6bde48d99f129c1db", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.829Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 5add8856dc4c7f493a4f640c back to queue queue:update_merge_requests", :jid=>"5add8856dc4c7f493a4f640c", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.831Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 693a321ac557b4cef774a380 back to queue queue:update_merge_requests", :jid=>"693a321ac557b4cef774a380", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.832Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 50e3d07d3fc913a9d8324c6f back to queue queue:update_merge_requests", :jid=>"50e3d07d3fc913a9d8324c6f", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.833Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job fdc8cc1e067f361652e539e8 back to queue queue:update_merge_requests", :jid=>"fdc8cc1e067f361652e539e8", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.849Z 8 TID-grfsd9jtg INFO: Bye!

The host process

1000      1799  0.0  0.0   4284     0 ?        Ss    2019   0:00  |   \_ /bin/sh -c /scripts/process-wrapper
1000      1826  0.0  0.0  19784     4 ?        S     2019   0:00  |       \_ /bin/bash /scripts/process-wrapper
1000      1827 91.6 12.5 7498768 4111964 ?     Rl    2019 34299:45  |           \_ sidekiq 5.2.7 gitlab [7 of 25 busy] stopping

Strace shows:

strace: Process 1827 attached with 11 threads
[pid  9043] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  4065] futex(0x7f881121a924, FUTEX_WAIT_PRIVATE, 14473590, NULL <unfinished ...>
[pid  4062] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid  4061] futex(0x7f87d803dcac, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid  4060] futex(0x7f87d8a2df4c, FUTEX_WAIT_PRIVATE, 803, NULL <unfinished ...>
[pid  2311] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  2292] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  2291] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  2290] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  2289] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1827] read(3,  <unfinished ...>
[pid  4065] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  1827] <... read resumed> "\1\0\0\0\0\0\0\0", 8) = 8
[pid  4065] futex(0x7f881121a924, FUTEX_WAIT_PRIVATE, 14473614, NULL <unfinished ...>
[pid  1827] getpid()                    = 8
[pid  1827] tgkill(8, 19548, SIGVTALRM <unfinished ...>
[pid  9043] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  9043] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] getpid( <unfinished ...>
[pid  9043] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... getpid resumed> )      = 8
[pid  9043] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid  1827] tgkill(8, 43, SIGVTALRM <unfinished ...>
[pid  9043] poll([{fd=103, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid  2311] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  2311] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] getpid( <unfinished ...>
[pid  2311] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... getpid resumed> )      = 8
[pid  2311] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid  1827] tgkill(8, 24, SIGVTALRM <unfinished ...>
[pid  2311] poll([{fd=106, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid  2292] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  2292] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] getpid( <unfinished ...>
[pid  2292] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... getpid resumed> )      = 8
[pid  2292] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid  1827] tgkill(8, 23, SIGVTALRM <unfinished ...>
[pid  2292] poll([{fd=113, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid  2291] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  2291] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] getpid( <unfinished ...>
[pid  2291] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... getpid resumed> )      = 8
[pid  2291] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid  1827] tgkill(8, 22, SIGVTALRM <unfinished ...>
[pid  2291] poll([{fd=92, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid  2290] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  2290] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] getpid( <unfinished ...>
[pid  2290] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... getpid resumed> )      = 8
[pid  2290] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid  1827] tgkill(8, 21, SIGVTALRM <unfinished ...>
[pid  2290] poll([{fd=98, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid  2289] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  2289] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] write(3, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid  2289] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... write resumed> )       = 8
[pid  2289] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid  1827] timer_settime(0, 0, {it_interval={0, 100000000}, it_value={0, 100000000}},  <unfinished ...>
[pid  2289] poll([{fd=94, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid  1827] <... timer_settime resumed> NULL) = 0
[pid  1827] timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
[pid  1827] sched_yield()               = 0
[pid  1827] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  1827] getpid()                    = 8
[pid  1827] tgkill(8, 19548, SIGVTALRM <unfinished ...>
[pid  9043] <... poll resumed> )        = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid  1827] <... tgkill resumed> )      = 0
[pid  9043] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid  1827] getpid( <unfinished ...>
[pid  9043] rt_sigreturn({mask=[]} <unfinished ...>
[pid  1827] <... getpid resumed> )      = 8
[pid  9043] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
Edited by 🤖 GitLab Bot 🤖