fellow softlock with small (16 MB) memory cache, threads waiting for memory in fellow_cache_obj_iter (mostly)
Hi again,
I've upgraded varnish to 7.4.2 and recompiled 1b7eb1c4, now I'm seeing the following:
After generating a few cache entries cache-main softlocks, one core is pegged at 100% and all requests to varnish timeout.
Stopping the service results in this panic:
systemd[1]: Stopping Varnish Cache, a high-performance HTTP accelerator...
varnishd[2317]: Child (2333) died signal=6 (core dumped)
varnishd[2317]: Child (2333) Panic at: Sun, 26 Nov 2023 17:31:22 GMT
Assert error in LRU_Free(), storage/storage_lru.c line 77:
Condition((((&lru->lru_head)->vtqh_first == ((void *)0))) != 0) not true.
version = varnish-7.4.2 revision cd1d10ab53a6f6115b2b4f3b2a1da94c1f749f80, vrt api = 18.0
ident = Linux,6.6.2-zen1-1-zen,x86_64,-junix,-sfellow,-sdefault,-Elibvmod_slash.so,-hcritbit,epoll
now = 27364.918376 (mono), 1701019881.133129 (real)
Backtrace:
0x55645620ca42: /usr/sbin/varnishd(+0x5ba42) [0x55645620ca42]
0x55645628be58: /usr/sbin/varnishd(VAS_Fail+0x18) [0x55645628be58]
0x556456264fe1: /usr/sbin/varnishd(LRU_Free+0xe1) [0x556456264fe1]
0x7efc30c92bd6: vext_cache/libvmod_slash.so,umhegvia.so(+0xebd6) [0x7efc30c92bd6]
0x556456262f7c: /usr/sbin/varnishd(STV_close+0x7c) [0x556456262f7c]
0x5564562078fd: /usr/sbin/varnishd(child_main+0x22d) [0x5564562078fd]
0x556456250565: /usr/sbin/varnishd(+0x9f565) [0x556456250565]
0x55645625190b: /usr/sbin/varnishd(MCH_Start_Child+0xb) [0x55645625190b]
0x5564561dcbf4: /usr/sbin/varnishd(main+0x2654) [0x5564561dcbf4]
0x7efc30f3ed0a: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7efc30f3ed0a]
0x5564561dccea: /usr/sbin/varnishd(_start+0x2a) [0x5564561dccea]
argv = {
[0] = \"/usr/sbin/varnishd\",
[1] = \"-E\",
[2] = \"/usr/lib/varnish/vmods/libvmod_slash.so\",
[3] = \"-a\",
[4] = \"/run/varnish/varnish.sock,mode=777\",
[5] = \"-a\",
[6] = \"localhost:8443,PROXY\",
[7] = \"-f\",
[8] = \"/etc/varnish/default.vcl\",
[9] = \"-P\",
[10] = \"/run/varnish/varnishd.pid\",
[11] = \"-s\",
[12] = \"fellow,/srv/fellow,1g,16m,64k\",
}
pthread.self = 0x7efc30d2df80
}
pthread.self = 0x7efc30d2df80
pthread.name = (cache-main)
pthread.attr = {
guard = 0,
stack_bottom = 0x7ffe57eeb000,
stack_top = 0x7ffe586ea000,
stack_size = 8384512,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
std = {0x7efc308d7230, Varnish 7.4.2 cd1d10ab53a6f6115b2b4f3b2a1da94c1f749f80, 0.0},
},
pools = {
pool = 0x7efc29600000 {
nidle = 5,
nthr = 1685,
lqueue = 0
},
pool = 0x7efc29600640 {
nidle = 5,
nthr = 1723,
lqueue = 0
},
},
varnishd[2317]: Child cleanup complete
varnishd[2317]: manager stopping child
varnishd[2317]: manager dies
systemd[1]: Received SIGCHLD from PID 2317 (varnishd).
systemd[1]: Child 2317 (varnishd) died (code=exited, status=192/n/a)
systemd[1]: varnish.service: Child 2317 belongs to varnish.service.
systemd[1]: varnish.service: Main process exited, code=exited, status=192/n/a
systemd[1]: varnish.service: Failed with result 'exit-code'.
I could not reproduce this with any other storage engine.
Interestingly this only happens 9 out 10 times, one time no timeouts occurred, everything worked as expected.
Here is the objdump.
Let me know if I can provide anything else.