This is showing all the same hallmarks as production#1309 (closed), where one sidekiq process is stuck in some strange state:
A single pegged CPU
strace complaining about x32 mode
and so on.
IMO this now warrants closer inspection, as it is:
No longer a single random single occurrence
On the same type of sidekiq node as last time
Just the CPU graph:
for asap-02 suggests that it started back on Nov 21. Drilling in suggests perhaps around 16:00 UTC, which coincides with a release being installed (this was lightly implicated in the previous incident).
I'm going to leave that process running for now, in case there's a chance it can provide us with information, although I'll probably kill it in the morning (just call me the Dread Pirate Miskell)
Thread 16 and 22 are the only ones not in pthread_cond_wait (someone else please verify that for me? I'd hate to mislead)
Of those two
22 is in an epoll_wait inside the GRPC gem; talking to gitaly I'm guessing?
16 in a postgres connection, but ultimately inside ubf_timer_disarm which some cursory research (https://github.com/ruby/ruby/blob/master/thread.c#L4363) suggests may have threading implications (that's not the call site in question, but clearly it's signalling/alarms/thread related.
The kernel thinks that pid 20847 still has a handful of open connections to postgres (10.217.4.5, the read/write pgbouncer load balancer); there's no sign of any other TCP connections from that pid.
Finishing up for today; I welcome any other input, but if there's nothing, I'll pick things up tomorrow and see if I can extract more info from the stack trace and some research.
Looks like a GVL dead-lock? It's waiting on the condition variable for the global VM lock to be released; but it probably never is. I wonder if this is caused by a native extension that performs its own locking and unlocking. In that case the problem would be found in whichever thread is holding the lock currently, not the one the stack trace was pulled from (that's just the symptom).
Can we pull thread dumps for all threads in the worker? Talk about finding the needle in a haystack ;-)
@mkaeppler That dump is for all threads on the problem worker; have I misunderstood your meaning?
Also I realized that I can tell (from top -H) which LWP is chewing CPU, and it's 23918, that thread is 16 (the postgres connection thread).
Also I don't know how I missed it, but there's actually two stuck sidekiq workers not just the one from Nov 21; the other is from Oct 30th - I think that's actually from production#1309 (closed) (ID 12729), which didn't go away as I stated. This is great, because now we have two to compare.
Looking again at LWP 23918 in the newer one, it's still in pgconn_async_exec at what looks to be the same basic place, just one call higher than last time (sampling error). Feels like it's in a tight loop trying to read from a postgres connection that is not (ever) responding.
Looking at LWP 12729 in the older trace, it's somewhere completely different, trying to do parse a ruby string/brace expansion from File#fnmatch (perhaps some sort of pathological and long globbing pattern?).
That dump is for all threads on the problem worker; have I misunderstood your meaning?
Ooops, yep sorry I only skimmed it yesterday and didn't see it's a single file for all threads.
Yes the link you posted seems to match the thread dump.
Here is what I read into it:
All but 2 thread groups (gitaly/grpc and a postgres connection thread) are waiting on the GVL
Gitaly is running in hands-free mode (no GVL), so this is not the reason other threads are blocked
Thread-16 (postgres) appears to be the only non-gitaly thread that is not waiting on a condition variable but is waiting for IO instead (wait_socket_readable); I think the function on top of the stack is part of MRI's timer logic that assigns time slices to Ruby threads and gives them an opportunity to release the GVL. Perhaps ubf_timer_disarm breaks, thus all threads remain waiting either on a condition variable or IO?
what confuses me is that no thread appears to actually be holding the GVL?
Your analysis seems right to me; this situation makes no sense, nothing is holding the GVL, and yet everything is waiting on it. Hard to know if Thread-16 is stuck in that do-loop because of some other problem, or if it being stuck is the cause of everything else waiting on the thread condition.
I don't think we're going to get any more information out of this situation (the stack traces speak for themselves). I have killed off both of these processes. If/when this recurs, it will be interesting to check the new stack traces, and see if there's any correlation.
On the general situation: perhaps our HUP/restart of sidekiq-cluster needs to be more vigorous about killing off older processes if they don't terminate cleanly when asked. I'll see if I can get an MR together.