Skip to content

Thread deadlocks

The process is complicated, but I will explain it in a timeline.

First of all, a panic occurred on 5/1. I had never seen anything like it and it was related to storage, but there was no "slash" part in the back trace, so I took a look at it for a while.

panic https://gitlab.com/-/snippets/2537518

Then, when the original number of objects was reached, the thread suddenly started locking up and ate them all up. In this state, it responded to pings from the parent process, but could not accept new requests.

image

At first I was not sure where the thread was locking up, so I took a dump of the threads and found that there were many threads waiting in ObjWaitState(BOS_STREAM).

###h2
Thread 1940 (Thread 0x7f7be9bbe640 (LWP 944075) "cache-worker"):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f7bf24a6dfc) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7f7bf24a6dfc) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f7bf24a6dfc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f7c4b202ac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f7bf24789d8, cond=0x7f7bf24a6dd0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x7f7bf24a6dd0, mutex=mutex@entry=0x7f7bf24789d8) at ./nptl/pthread_cond_wait.c:627
#5  0x00005641f24f8e94 in Lck_CondWaitUntil (cond=0x7f7bf24a6dd0, lck=<optimized out>, when=when@entry=0) at cache/cache_lck.c:238
#6  0x00005641f24f902d in Lck_CondWait (cond=<optimized out>, lck=<optimized out>) at cache/cache_lck.c:211
#7  0x00005641f24fc0a5 in ObjWaitState (oc=0x7f7bf24ade00, want=want@entry=BOS_STREAM) at cache/cache_obj.c:339
#8  0x00005641f24ec559 in VBF_Fetch (wrk=wrk@entry=0x7f7be9bbc4f0, req=req@entry=0x7f7bf2a247e0, oc=<optimized out>, oldoc=0x0, mode=mode@entry=VBF_NORMAL) at cache/cache_fetch.c:1200
#9  0x00005641f25037d0 in cnt_miss (wrk=0x7f7be9bbc4f0, req=0x7f7bf2a247e0) at cache/cache_req_fsm.c:685
#10 0x00005641f25073e0 in CNT_Request (req=req@entry=0x7f7bf2a247e0) at cache/cache_req_fsm.c:1175
#11 0x00005641f253996b in h2_do_req (wrk=0x7f7be9bbc4f0, priv=0x7f7bf2a247e0) at http2/cache_http2_proto.c:543
#12 0x00005641f2526a12 in Pool_Work_Thread (wrk=0x7f7be9bbc4f0, pp=0x7f7c43800000) at cache/cache_wrk.c:491
#13 WRK_Thread (qp=0x7f7c43800000, stacksize=<optimized out>, thread_workspace=<optimized out>) at cache/cache_wrk.c:157
#14 0x00005641f2526fd8 in pool_thread (priv=0x7f7bf8d98760) at cache/cache_wrk.c:524
#15 0x00007f7c4b203b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007f7c4b295a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


###h1
Thread 1935 (Thread 0x7f7be9c27640 (LWP 944070) "cache-worker"):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f7bea38193c) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7f7bea38193c) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f7bea38193c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f7c4b202ac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f7bea393ef8, cond=0x7f7bea381910) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x7f7bea381910, mutex=mutex@entry=0x7f7bea393ef8) at ./nptl/pthread_cond_wait.c:627
#5  0x00005641f24f8e94 in Lck_CondWaitUntil (cond=0x7f7bea381910, lck=<optimized out>, when=when@entry=0) at cache/cache_lck.c:238
#6  0x00005641f24f902d in Lck_CondWait (cond=<optimized out>, lck=<optimized out>) at cache/cache_lck.c:211
#7  0x00005641f24fc0a5 in ObjWaitState (oc=0x7f7bea38f180, want=want@entry=BOS_STREAM) at cache/cache_obj.c:339
#8  0x00005641f24ec559 in VBF_Fetch (wrk=wrk@entry=0x7f7be9c254f0, req=req@entry=0x7f7bf23b8620, oc=<optimized out>, oldoc=0x0, mode=mode@entry=VBF_NORMAL) at cache/cache_fetch.c:1200
#9  0x00005641f25037d0 in cnt_miss (wrk=0x7f7be9c254f0, req=0x7f7bf23b8620) at cache/cache_req_fsm.c:685
#10 0x00005641f25073e0 in CNT_Request (req=req@entry=0x7f7bf23b8620) at cache/cache_req_fsm.c:1175
#11 0x00005641f25322c2 in HTTP1_Session (req=0x7f7bf23b8620, wrk=0x7f7be9c254f0) at http1/cache_http1_fsm.c:392
#12 http1_req (wrk=0x7f7be9c254f0, arg=0x7f7bf23b8620) at http1/cache_http1_fsm.c:87
#13 0x00005641f2526a12 in Pool_Work_Thread (wrk=0x7f7be9c254f0, pp=0x7f7c43800000) at cache/cache_wrk.c:491
#14 WRK_Thread (qp=0x7f7c43800000, stacksize=<optimized out>, thread_workspace=<optimized out>) at cache/cache_wrk.c:157
#15 0x00005641f2526fd8 in pool_thread (priv=0x7f7bf8d986c0) at cache/cache_wrk.c:524
#16 0x00007f7c4b203b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#17 0x00007f7c4b295a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

full dump is here. https://gitlab.com/-/snippets/2537515

stats. https://gitlab.com/-/snippets/2537541

Unfortunately, they were killed by the parent when the dump was taken and restarted. All stored objects seem to have been cleared, but it is working fine now.

I cannot be sure if this is related to slash, but I believe it probably is since it only occurred on the node verifying slash. If it is not related, I apologize.

Please let us know if there is any other information you need.

Enviroment

OS                Ubuntu 22.04(5.15.0-52-generic)
varnishd          7.3.0(84d79120b6d17b11819a663a93160743f293e63f)
slash             49bbf2c63370a09104d7603a1776e689e7bdaa93
libxxhash0:amd64  0.8.1-1
liburing2:amd64   2.1-2build1 

This proxy mainly handles static resources and does not normally handle traffic that uses pesi and memory=buddy,100M, although it is loaded.

I forgot to set the ulimit to unlimit in the settings. This is the Varnish default (LimitMEMLOCK=100M).

Edited by Shohei Tanaka