Crash on fast libvirtd restart
Software environment
- Operating system: Ubuntu 20.04
- Architecture: x86_64
- kernel version: 5.4.0-84-generic
- libvirt version: 7.0.0
- Hypervisor and version: qemu 5.2
Description of problem
TL;DR:
- segfault on service restart
systemd[1]: Stopping Virtualization daemon...
kernel: libvirtd[6040]: segfault at 50 ip 00007fa0c3437fc4 sp 00007ffc1df32a18 error 4 in libpthread-2.31.so[7fa0c3433000+11000]
kernel: Code: 7e 8f 45 31 d2 ba 01 00 00 00 be 01 00 00 00 48 89 ef b8 ca 00 00 00 0f 05 e9 73 ff ff ff e8 13 b7 ff ff 0f 1f 00 f3 0f 1e fa <
Background: I've happened to see a segfault when upgrades were executed, so wanted to know what is going on. I found plenty of interesting details, but no clear fix yet. The problem is racy by nature and I only see it with the slightly old v7.0.0 but I found no related fix (that I'd recognize) in git, so it might be just the race window that slightly changed. Therefore I report an issue so that more people can have a look - thanks in advance.
Steps to reproduce
Initially I thought this is super weird only hitting it at certain combinations of packages that needed to be installed. But eventually I found that it is just too fast restart that is needed. So I removed all my nice odd details and can say it reproduces with:
$ while /bin/true; do sudo systemctl restart libvirtd; done
Additional information
In gdb it looks like this:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___pthread_mutex_lock (mutex=0x40) at ../nptl/pthread_mutex_lock.c:67
67 ../nptl/pthread_mutex_lock.c: No such file or directory.
The trace with detail is like:
(gdb) bt
#0 __GI___pthread_mutex_lock (mutex=0x40) at ../nptl/pthread_mutex_lock.c:67
#1 0x00007f44e0ceb91a in virThreadPoolStop (pool=0x0) at ../../src/util/virthreadpool.c:496
#2 0x00007f44d81f4de8 in qemuStateShutdownPrepare () at ../../src/qemu/qemu_driver.c:1075
#3 0x00007f44e0e9fc80 in virStateShutdownPrepare () at ../../src/libvirt.c:690
#4 0x00007f44e0dab74d in virNetDaemonRun (dmn=0x5641d4166830) at ../../src/rpc/virnetdaemon.c:869
#5 0x00005641d24a1767 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1209
On recreating the issue the signature above looks the same other than effects of adress randomization.
Frame 1 that does the bad call is:
void
virThreadPoolStop(virThreadPoolPtr pool)
{
virMutexLock(&pool->mutex);
virThreadPoolStopLocked(pool);
virMutexUnlock(&pool->mutex);
}
And since we see in the backtrace that it is called with pool=0x0
that can only fail referencing 0x0 at some point.
That is what mutex=0x40
shows and then segfaults.
This call comes from
static int
qemuStateShutdownPrepare(void)
{
virThreadPoolStop(qemu_driver->workerPool);
return 0;
}
And that means this likely is a race eventually passing 0x0 as workerPool
.
In the past there was 1 which seems similar, but that is old and applied.
We might have a new race here but I can't exactly see why nor how the
odd combination of installed packages is needed to trigger it.
Maybe the cleanup in 2 has by accident forgotten/ignored the reason to add 1 in the first place and I now just have happened to find a way to trigger the known race that was re-opened by 2?
Is it fixed already?
Since I can not recreate the same on v7.6.0 and later it might be fixed already. So maybe the fix was in a place I did not recognize that the commit was related? In that case I'd appreciate a pointer to the fix to give that a try. But OTOH it is racy by nature and might just be avoided by accident in the newer version and worth a look/fix still.