AArch32 LDRD alignment requirements are different under multi-cpu TCG vs KVM
Host environment
- Operating system: Arch Linux
- OS/kernel version: Linux cub3d-arch-desktop 6.17.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 06 Oct 2025 18:48:29 +0000 x86_64 GNU/Linux /0.0s
- Architecture: x86_64
- QEMU flavor: qemu-system-aarch64
- QEMU version: QEMU emulator version 10.1.50
- QEMU command line:
qemu-system-aarch64 -kernel poc --machine virt --cpu cortex-a72 -serial stdio -m 2G -smp 2
Emulated/Virtualized environment
- Operating system: Custom
- OS/kernel version: N/A
- Architecture: AArch64 EL1, AArch32 EL0
Description of problem
The LDRD instruction behaves differently under TCG and KVM in some situations.
Running the following on the x86 host above:
qemu-system-aarch64 -kernel poc --machine virt --cpu cortex-a72 -serial stdio -m 2G -smp 1
produces this output over serial (SVC):
panic: Unexpected sync_lower, esr = 0x44000000
Running the same command with -smp 2:
qemu-system-aarch64 -kernel poc --machine virt --cpu cortex-a72 -serial stdio -m 2G -smp 2
produces this output (Alignment fault):
panic: Unexpected sync_lower, esr = 0x92000021
Whereas running both commands under KVM (Tested on a Raspberry PI 4B, BCM2711 / Cortex-A72) produces this output:
panic: Unexpected sync_lower, esr = 0x44000000
This example is a:
- AArch64 EL1 kernel
- With a AArch32 EL0
- That executes a
LDRDfromDevicememory with a 32-bit aligned address, then calling aSVC:
# x4 = 0x80080800
ldrd r5, r2, [r4, #4]
svc 0
a: b a /* hang */
In the multi-cpu TCG case only, the LDRD produces an alignment fault.
Under TCG in do_ldrd_load the alignment requirements of the LDRD seem to come from here:
MemOp opc = MO_64 | MO_ALIGN_4 | MO_ATOM_SUBALIGN | s->be_data;
However because of the following in tcg_canonicalize_memop these alignment requirements are reduced when running on a single cpu (CF_PARALLEL is set if maxcpus > 1), explaining why this only happens on the -smp 2 case
/* In serial mode, reduce atomicity. */
if (!(tcg_ctx->gen_tb->cflags & CF_PARALLEL)) {
op &= ~MO_ATOM_MASK;
op |= MO_ATOM_NONE;
}
I'm not familiar enough with TCG internals to know what the correct fix would be, but it seems that the atomicity requirements of this instruction are wrong when compared to real hardware.
Additional information
This issue also seems to affect the STRD instruction