CPU hotplug crashed the guest when using virt-type as qemu!
Host environment
- Operating system: Fedora41
- OS/kernel version: 6.11.7-300.fc41.ppc64le
- Architecture: ppc64
- QEMU flavor: qemu-system-ppc64
- QEMU version: QEMU emulator version 9.1.1 (qemu-9.1.1-1.fc41)
libvirt xml snippet:
<domain type='qemu'>
<name>linux</name>
<uuid>cba9037f-2a62-41f9-98c1-0780b2ff49b9</uuid>
<maxMemory slots='16' unit='KiB'>419430400</maxMemory>
<memory unit='KiB'>20971520</memory>
<currentMemory unit='KiB'>10485760</currentMemory>
<memoryBacking>
<locked/>
</memoryBacking>
<vcpu placement='static' current='4'>1024</vcpu>
Emulated/Virtualized environment
- Operating system: Fedora41
- OS/kernel version: 6.11.7-300.fc41.ppc64le
- Architecture: ppc64
Description of problem
Guest is getting crashing and getting into shutoff state when I am trying to hotplug much more cpus than present in the host! This is happening only when i give virt-type as qemu.
Steps to reproduce:
- Start a guest with virt-type as qemu
<domain type='qemu'>
<name>linux</name>
<uuid>cba9037f-2a62-41f9-98c1-0780b2ff49b9</uuid>
<maxMemory slots='16' unit='KiB'>419430400</maxMemory>
<memory unit='KiB'>20971520</memory>
<currentMemory unit='KiB'>10485760</currentMemory>
<memoryBacking>
<locked/>
</memoryBacking>
<vcpu placement='static' current='4'>1024</vcpu>
- lscpu on host:
lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 5
Socket(s): 1
Physical sockets: 4
Physical chips: 1
Physical cores/chip: 12
-
[On host] virsh setvcpus linux 800 error: Unable to read from monitor: Connection reset by peer
-
Guest is getting into shutoff state
-
Issue is seen with upstream qemu also.
Additional information
Tried reproducing while attaching gdb shows below backtrace which happened after hotplugging 249 CPUs in TCG mode:
Thread 261 "CPU 249/TCG" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ff97c00ea20 (LWP 51567)]
0x00007fff84cac3e8 in __pthread_kill_implementation () from target:/lib64/glibc-hwcaps/power10/libc.so.6
(gdb) bt
#0 0x00007fff84cac3e8 in __pthread_kill_implementation () from target:/lib64/glibc-hwcaps/power10/libc.so.6
#1 0x00007fff84c46ba0 in raise () from target:/lib64/glibc-hwcaps/power10/libc.so.6
#2 0x00007fff84c29354 in abort () from target:/lib64/glibc-hwcaps/power10/libc.so.6
#3 0x00007fff850f1e30 in g_assertion_message () from target:/lib64/libglib-2.0.so.0
#4 0x00007fff850f1ebc in g_assertion_message_expr () from target:/lib64/libglib-2.0.so.0
#5 0x00000001376c6f00 in tcg_region_initial_alloc__locked (s=0x7fff7c000f30) at ../tcg/region.c:396
#6 0x00000001376c6fa8 in tcg_region_initial_alloc (s=0x7fff7c000f30) at ../tcg/region.c:402
#7 0x00000001376dae08 in tcg_register_thread () at ../tcg/tcg.c:1011
#8 0x000000013768b7e4 in mttcg_cpu_thread_fn (arg=0x143e884f0) at ../accel/tcg/tcg-accel-ops-mttcg.c:77
#9 0x0000000137bbb2d0 in qemu_thread_start (args=0x143b4aff0) at ../util/qemu-thread-posix.c:542
#10 0x00007fff84ca9be0 in start_thread () from target:/lib64/glibc-hwcaps/power10/libc.so.6
#11 0x00007fff84d4ef3c in __clone3 () from target:/lib64/glibc-hwcaps/power10/libc.so.6
(gdb)
which points to below code:
/*
* Perform a context's first region allocation.
* This function does _not_ increment region.agg_size_full.
*/
static void tcg_region_initial_alloc__locked(TCGContext *s)
{
bool err = tcg_region_alloc__locked(s);
g_assert(!err);
}
Here, tcg_region_alloc__locked returns true on failure when max region allocation is reached and therefore intentionally asserted as TCG can't proceed without it.
static bool tcg_region_alloc__locked(TCGContext *s)
{
if (region.current == region.n) {
return true;
}
tcg_region_assign(s, region.current);
region.current++;
return false;
}
Edited by Anushree Mathur