Sidekiq stuck after SIGTSTP
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Gitlab CE 12.3.4, running on k8s with helm chart install.
I have two sidekiq instances, both having the same issue a couple days apart. I'm not sure why the sidekiq processes received the SIGTSTP in the first place, but clearly sidekiq is unable to terminate.
2019-12-31T03:22:44.446Z 8 TID-ouo9fdg70 PipelineProcessWorker JID-18078f7f797961262cb526ea INFO: done: 0.605 sec
2019-12-31T03:22:44.621Z 8 TID-ouo6ncsqk Namespaces::ScheduleAggregationWorker JID-b31fbe5e2e13dc66b0613ba4 INFO: start
2019-12-31T03:22:44.647Z 8 TID-ouo6ncsqk Namespaces::ScheduleAggregationWorker JID-b31fbe5e2e13dc66b0613ba4 INFO: done: 0.026 sec
2019-12-31T03:22:44.653Z 8 TID-ountdduew ExpirePipelineCacheWorker JID-88b0a4c9ce673c406cb953ff INFO: done: 0.315 sec
2019-12-31T03:22:44.764Z 8 TID-grfv42j8k ArchiveTraceWorker JID-8fa2975fa5d6c01a8f016f47 INFO: done: 0.431 sec
2019-12-31T03:22:44.837Z 8 TID-grftc8wwc PipelineNotificationWorker JID-2591ac4029e5e4bc1611e0e3 INFO: done: 0.42 sec
2019-12-31T03:23:36.755Z 8 TID-ounti7pw4 ProjectCacheWorker JID-42e7b36039f0277d536e0a8a INFO: start
Updating statistics for project 4
2019-12-31T03:23:37.588Z 8 TID-ounti7pw4 ProjectCacheWorker JID-42e7b36039f0277d536e0a8a INFO: done: 0.834 sec
2019-12-31T03:23:37.969Z 8 TID-ouo6ncsqk GitGarbageCollectWorker JID-a5fe7bf2cf636d6af84e1700 INFO: start
2019-12-31T03:23:38.274Z 8 TID-ouo6ncsqk GitGarbageCollectWorker JID-a5fe7bf2cf636d6af84e1700 INFO: done: 0.305 sec
2019-12-31T03:25:07.671Z 8 TID-ouog3qd48 PipelineScheduleWorker JID-f2ae9f946db538da575c53dd INFO: start
2019-12-31T03:25:07.683Z 8 TID-ouog3qd48 PipelineScheduleWorker JID-f2ae9f946db538da575c53dd INFO: done: 0.013 sec
2019-12-31T03:26:20.254Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGTSTP", :message=>"sending Sidekiq worker PID-8 SIGTSTP (stop fetching new jobs)"}
2019-12-31T03:26:20.290Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGTERM", :message=>"waiting 30 seconds before sending Sidekiq worker PID-8 SIGTERM (gracefully shut down)"}
2019-12-31T03:26:20.308Z 8 TID-grfsd9jtg INFO: Received TSTP, no longer accepting new work
2019-12-31T03:26:20.308Z 8 TID-grfv4eer4 INFO: Scheduler exiting...
2019-12-31T03:26:20.308Z 8 TID-grfsd9jtg INFO: Terminating quiet workers
2019-12-31T03:26:20.309Z 8 TID-grfuao3fs INFO: Scheduler exiting...
2019-12-31T03:26:50.290Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGTERM", :message=>"sending Sidekiq worker PID-8 SIGTERM (gracefully shut down)"}
2019-12-31T03:26:50.290Z 8 TID-ouognjgqo WARN: {:class=>"PostReceive", :pid=>8, :signal=>"SIGKILL", :message=>"waiting 7 seconds before sending Sidekiq worker PID-8 SIGKILL (die)"}
2019-12-31T03:26:50.291Z 8 TID-grfsd9jtg INFO: Shutting down
2019-12-31T03:26:50.791Z 8 TID-grfsd9jtg INFO: Pausing to allow workers to finish...
2019-12-31T03:26:54.813Z 8 TID-grfsd9jtg WARN: Terminating 7 busy worker threads
2019-12-31T03:26:54.814Z 8 TID-grfsd9jtg WARN: Work still in progress [#<struct Sidekiq::BaseReliableFetch::UnitOfWork queue="queue:update_merge_requests", job="{\"class\":\"UpdateMergeRequestsWorker\",\"args\" <snip>
2019-12-31T03:26:54.825Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 4bb93c97188e306e12cfa969 back to queue queue:update_merge_requests", :jid=>"4bb93c97188e306e12cfa969", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.827Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 85e1d03215b6167c961007b6 back to queue queue:update_merge_requests", :jid=>"85e1d03215b6167c961007b6", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.828Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job f34c07b6bde48d99f129c1db back to queue queue:update_merge_requests", :jid=>"f34c07b6bde48d99f129c1db", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.829Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 5add8856dc4c7f493a4f640c back to queue queue:update_merge_requests", :jid=>"5add8856dc4c7f493a4f640c", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.831Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 693a321ac557b4cef774a380 back to queue queue:update_merge_requests", :jid=>"693a321ac557b4cef774a380", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.832Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job 50e3d07d3fc913a9d8324c6f back to queue queue:update_merge_requests", :jid=>"50e3d07d3fc913a9d8324c6f", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.833Z 8 TID-grfsd9jtg INFO: {:message=>"Pushed job fdc8cc1e067f361652e539e8 back to queue queue:update_merge_requests", :jid=>"fdc8cc1e067f361652e539e8", :queue=>"queue:update_merge_requests"}
2019-12-31T03:26:54.849Z 8 TID-grfsd9jtg INFO: Bye!
The host process
1000 1799 0.0 0.0 4284 0 ? Ss 2019 0:00 | \_ /bin/sh -c /scripts/process-wrapper
1000 1826 0.0 0.0 19784 4 ? S 2019 0:00 | \_ /bin/bash /scripts/process-wrapper
1000 1827 91.6 12.5 7498768 4111964 ? Rl 2019 34299:45 | \_ sidekiq 5.2.7 gitlab [7 of 25 busy] stopping
Strace shows:
strace: Process 1827 attached with 11 threads
[pid 9043] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 4065] futex(0x7f881121a924, FUTEX_WAIT_PRIVATE, 14473590, NULL <unfinished ...>
[pid 4062] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 4061] futex(0x7f87d803dcac, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 4060] futex(0x7f87d8a2df4c, FUTEX_WAIT_PRIVATE, 803, NULL <unfinished ...>
[pid 2311] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 2292] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 2291] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 2290] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 2289] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 1827] read(3, <unfinished ...>
[pid 4065] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 1827] <... read resumed> "\1\0\0\0\0\0\0\0", 8) = 8
[pid 4065] futex(0x7f881121a924, FUTEX_WAIT_PRIVATE, 14473614, NULL <unfinished ...>
[pid 1827] getpid() = 8
[pid 1827] tgkill(8, 19548, SIGVTALRM <unfinished ...>
[pid 9043] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 9043] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] getpid( <unfinished ...>
[pid 9043] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... getpid resumed> ) = 8
[pid 9043] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 1827] tgkill(8, 43, SIGVTALRM <unfinished ...>
[pid 9043] poll([{fd=103, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid 2311] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 2311] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] getpid( <unfinished ...>
[pid 2311] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... getpid resumed> ) = 8
[pid 2311] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 1827] tgkill(8, 24, SIGVTALRM <unfinished ...>
[pid 2311] poll([{fd=106, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid 2292] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 2292] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] getpid( <unfinished ...>
[pid 2292] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... getpid resumed> ) = 8
[pid 2292] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 1827] tgkill(8, 23, SIGVTALRM <unfinished ...>
[pid 2292] poll([{fd=113, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid 2291] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 2291] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] getpid( <unfinished ...>
[pid 2291] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... getpid resumed> ) = 8
[pid 2291] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 1827] tgkill(8, 22, SIGVTALRM <unfinished ...>
[pid 2291] poll([{fd=92, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid 2290] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 2290] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] getpid( <unfinished ...>
[pid 2290] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... getpid resumed> ) = 8
[pid 2290] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 1827] tgkill(8, 21, SIGVTALRM <unfinished ...>
[pid 2290] poll([{fd=98, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid 2289] <... restart_syscall resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 2289] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] write(3, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid 2289] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... write resumed> ) = 8
[pid 2289] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 1827] timer_settime(0, 0, {it_interval={0, 100000000}, it_value={0, 100000000}}, <unfinished ...>
[pid 2289] poll([{fd=94, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
[pid 1827] <... timer_settime resumed> NULL) = 0
[pid 1827] timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
[pid 1827] sched_yield() = 0
[pid 1827] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 1827] getpid() = 8
[pid 1827] tgkill(8, 19548, SIGVTALRM <unfinished ...>
[pid 9043] <... poll resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 1827] <... tgkill resumed> ) = 0
[pid 9043] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=8, si_uid=1000} ---
[pid 1827] getpid( <unfinished ...>
[pid 9043] rt_sigreturn({mask=[]} <unfinished ...>
[pid 1827] <... getpid resumed> ) = 8
[pid 9043] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
Edited by 🤖 GitLab Bot 🤖