fellow softlock with small (16 MB) memory cache, threads waiting for memory in fellow_cache_obj_iter (mostly)

Hi again,

I've upgraded varnish to 7.4.2 and recompiled 1b7eb1c4, now I'm seeing the following:

After generating a few cache entries cache-main softlocks, one core is pegged at 100% and all requests to varnish timeout.

Stopping the service results in this panic:

systemd[1]: Stopping Varnish Cache, a high-performance HTTP accelerator...
varnishd[2317]: Child (2333) died signal=6 (core dumped)
varnishd[2317]: Child (2333) Panic at: Sun, 26 Nov 2023 17:31:22 GMT
                Assert error in LRU_Free(), storage/storage_lru.c line 77:
                  Condition((((&lru->lru_head)->vtqh_first == ((void *)0))) != 0) not true.
                version = varnish-7.4.2 revision cd1d10ab53a6f6115b2b4f3b2a1da94c1f749f80, vrt api = 18.0
                ident = Linux,6.6.2-zen1-1-zen,x86_64,-junix,-sfellow,-sdefault,-Elibvmod_slash.so,-hcritbit,epoll
                now = 27364.918376 (mono), 1701019881.133129 (real)
                Backtrace:
                  0x55645620ca42: /usr/sbin/varnishd(+0x5ba42) [0x55645620ca42]
                  0x55645628be58: /usr/sbin/varnishd(VAS_Fail+0x18) [0x55645628be58]
                  0x556456264fe1: /usr/sbin/varnishd(LRU_Free+0xe1) [0x556456264fe1]
                  0x7efc30c92bd6: vext_cache/libvmod_slash.so,umhegvia.so(+0xebd6) [0x7efc30c92bd6]
                  0x556456262f7c: /usr/sbin/varnishd(STV_close+0x7c) [0x556456262f7c]
                  0x5564562078fd: /usr/sbin/varnishd(child_main+0x22d) [0x5564562078fd]
                  0x556456250565: /usr/sbin/varnishd(+0x9f565) [0x556456250565]
                  0x55645625190b: /usr/sbin/varnishd(MCH_Start_Child+0xb) [0x55645625190b]
                  0x5564561dcbf4: /usr/sbin/varnishd(main+0x2654) [0x5564561dcbf4]
                  0x7efc30f3ed0a: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7efc30f3ed0a]
                  0x5564561dccea: /usr/sbin/varnishd(_start+0x2a) [0x5564561dccea]
                argv = {
                  [0] = \"/usr/sbin/varnishd\",
                  [1] = \"-E\",
                  [2] = \"/usr/lib/varnish/vmods/libvmod_slash.so\",
                  [3] = \"-a\",
                  [4] = \"/run/varnish/varnish.sock,mode=777\",
                  [5] = \"-a\",
                  [6] = \"localhost:8443,PROXY\",
                  [7] = \"-f\",
                  [8] = \"/etc/varnish/default.vcl\",
                  [9] = \"-P\",
                  [10] = \"/run/varnish/varnishd.pid\",
                  [11] = \"-s\",
                  [12] = \"fellow,/srv/fellow,1g,16m,64k\",
                }
                pthread.self = 0x7efc30d2df80
                }
                pthread.self = 0x7efc30d2df80
                pthread.name = (cache-main)
                pthread.attr = {
                  guard = 0,
                  stack_bottom = 0x7ffe57eeb000,
                  stack_top = 0x7ffe586ea000,
                  stack_size = 8384512,
                }
                thr.req = NULL
                thr.busyobj = NULL
                thr.worker = NULL
                vmods = {
                  std = {0x7efc308d7230, Varnish 7.4.2 cd1d10ab53a6f6115b2b4f3b2a1da94c1f749f80, 0.0},
                },
                pools = {
                  pool = 0x7efc29600000 {
                    nidle = 5,
                    nthr = 1685,
                    lqueue = 0
                  },
                  pool = 0x7efc29600640 {
                    nidle = 5,
                    nthr = 1723,
                    lqueue = 0
                  },
                },
varnishd[2317]: Child cleanup complete
varnishd[2317]: manager stopping child
varnishd[2317]: manager dies
systemd[1]: Received SIGCHLD from PID 2317 (varnishd).
systemd[1]: Child 2317 (varnishd) died (code=exited, status=192/n/a)
systemd[1]: varnish.service: Child 2317 belongs to varnish.service.
systemd[1]: varnish.service: Main process exited, code=exited, status=192/n/a
systemd[1]: varnish.service: Failed with result 'exit-code'.

I could not reproduce this with any other storage engine.

Interestingly this only happens 9 out 10 times, one time no timeouts occurred, everything worked as expected.

Here is the objdump.

Let me know if I can provide anything else.