AArch32 LDRD alignment requirements are different under multi-cpu TCG vs KVM

Host environment

Operating system: Arch Linux
OS/kernel version: Linux cub3d-arch-desktop 6.17.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 06 Oct 2025 18:48:29 +0000 x86_64 GNU/Linux /0.0s
Architecture: x86_64
QEMU flavor: qemu-system-aarch64
QEMU version: QEMU emulator version 10.1.50

QEMU command line:

qemu-system-aarch64 -kernel poc --machine virt --cpu cortex-a72 -serial stdio -m 2G -smp 2

Emulated/Virtualized environment

Operating system: Custom
OS/kernel version: N/A
Architecture: AArch64 EL1, AArch32 EL0

Description of problem

The LDRD instruction behaves differently under TCG and KVM in some situations.

Running the following on the x86 host above:

qemu-system-aarch64 -kernel poc --machine virt --cpu cortex-a72 -serial stdio -m 2G -smp 1

produces this output over serial (SVC):

panic: Unexpected sync_lower, esr = 0x44000000

Running the same command with -smp 2:

qemu-system-aarch64 -kernel poc --machine virt --cpu cortex-a72 -serial stdio -m 2G -smp 2

produces this output (Alignment fault):

panic: Unexpected sync_lower, esr = 0x92000021

Whereas running both commands under KVM (Tested on a Raspberry PI 4B, BCM2711 / Cortex-A72) produces this output:

panic: Unexpected sync_lower, esr = 0x44000000

This example is a:

AArch64 EL1 kernel
With a AArch32 EL0
That executes a LDRD from Device memory with a 32-bit aligned address, then calling a SVC:

# x4 = 0x80080800
ldrd r5, r2, [r4, #4]
svc 0
a: b a /* hang */

In the multi-cpu TCG case only, the LDRD produces an alignment fault.

Under TCG in do_ldrd_load the alignment requirements of the LDRD seem to come from here:

    MemOp opc = MO_64 | MO_ALIGN_4 | MO_ATOM_SUBALIGN | s->be_data;

However because of the following in tcg_canonicalize_memop these alignment requirements are reduced when running on a single cpu (CF_PARALLEL is set if maxcpus > 1), explaining why this only happens on the -smp 2 case

    /* In serial mode, reduce atomicity. */
    if (!(tcg_ctx->gen_tb->cflags & CF_PARALLEL)) {
        op &= ~MO_ATOM_MASK;
        op |= MO_ATOM_NONE;
    }

I'm not familiar enough with TCG internals to know what the correct fix would be, but it seems that the atomicity requirements of this instruction are wrong when compared to real hardware.

Additional information

poc_for_qemu.tar.gz

This issue also seems to affect the STRD instruction

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information