Crash on fast libvirtd restart

Software environment

Operating system: Ubuntu 20.04
Architecture: x86_64
kernel version: 5.4.0-84-generic
libvirt version: 7.0.0
Hypervisor and version: qemu 5.2

Description of problem

TL;DR:

segfault on service restart

systemd[1]: Stopping Virtualization daemon...
kernel: libvirtd[6040]: segfault at 50 ip 00007fa0c3437fc4 sp 00007ffc1df32a18 error 4 in libpthread-2.31.so[7fa0c3433000+11000]
kernel: Code: 7e 8f 45 31 d2 ba 01 00 00 00 be 01 00 00 00 48 89 ef b8 ca 00 00 00 0f 05 e9 73 ff ff ff e8 13 b7 ff ff 0f 1f 00 f3 0f 1e fa <

Background: I've happened to see a segfault when upgrades were executed, so wanted to know what is going on. I found plenty of interesting details, but no clear fix yet. The problem is racy by nature and I only see it with the slightly old v7.0.0 but I found no related fix (that I'd recognize) in git, so it might be just the race window that slightly changed. Therefore I report an issue so that more people can have a look - thanks in advance.

Steps to reproduce

Initially I thought this is super weird only hitting it at certain combinations of packages that needed to be installed. But eventually I found that it is just too fast restart that is needed. So I removed all my nice odd details and can say it reproduces with:

$ while /bin/true; do sudo systemctl restart libvirtd; done

Additional information

In gdb it looks like this:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x40) at ../nptl/pthread_mutex_lock.c:67
67	../nptl/pthread_mutex_lock.c: No such file or directory.

The trace with detail is like:

(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x40) at ../nptl/pthread_mutex_lock.c:67
#1  0x00007f44e0ceb91a in virThreadPoolStop (pool=0x0) at ../../src/util/virthreadpool.c:496
#2  0x00007f44d81f4de8 in qemuStateShutdownPrepare () at ../../src/qemu/qemu_driver.c:1075
#3  0x00007f44e0e9fc80 in virStateShutdownPrepare () at ../../src/libvirt.c:690
#4  0x00007f44e0dab74d in virNetDaemonRun (dmn=0x5641d4166830) at ../../src/rpc/virnetdaemon.c:869
#5  0x00005641d24a1767 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1209

On recreating the issue the signature above looks the same other than effects of adress randomization.

Frame 1 that does the bad call is:

void
virThreadPoolStop(virThreadPoolPtr pool)
{
    virMutexLock(&pool->mutex);
    virThreadPoolStopLocked(pool);
    virMutexUnlock(&pool->mutex);
}

And since we see in the backtrace that it is called with pool=0x0 that can only fail referencing 0x0 at some point. That is what mutex=0x40 shows and then segfaults.

This call comes from

static int
qemuStateShutdownPrepare(void)
{
    virThreadPoolStop(qemu_driver->workerPool);
    return 0;
}

And that means this likely is a race eventually passing 0x0 as workerPool. In the past there was 1 which seems similar, but that is old and applied. We might have a new race here but I can't exactly see why nor how the odd combination of installed packages is needed to trigger it.

Maybe the cleanup in 2 has by accident forgotten/ignored the reason to add 1 in the first place and I now just have happened to find a way to trigger the known race that was re-opened by 2?

Is it fixed already?

Since I can not recreate the same on v7.6.0 and later it might be fixed already. So maybe the fix was in a place I did not recognize that the commit was related? In that case I'd appreciate a pointer to the fix to give that a try. But OTOH it is racy by nature and might just be avoided by accident in the newer version and worth a look/fix still.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information