ARM user-mode: TBs of self-modifying code not invalidated after __ARM_NR_CACHEFLUSH syscall
Host environment
- Operating system: Arch Linux (CachyOS)
- OS/kernel version: Linux 6.17.1-2-cachyos
- Architecture: x86_64
- QEMU flavor: qemu-arm
- QEMU version: 10.1.0
Emulated/Virtualized environment
- Operating system: Linux (user mode emulation)
- OS/kernel version: N/A (user mode emulation)
- Architecture: ARM (
arm-linux-gnueabihf)
Description of problem
On ARM user-mode emulation, when creating anonymous shared memory using memfd_create and mapping both a R/X and R/W view for the purposes of just-in-time code generation and execution, TCG does not invalidate translated code executed through the RX pointer when it is modified through the RW pointer.
This happens despite using the appropriate syscall for invalidating the icache on ARM Linux targets, i.e. __ARM_NR_CACHEFLUSH (0x0f0002). Currently QEMU does not do anything when receiving this syscall.
Note that this issue does not occur when using a single RWX region to emit and later modify code. It only happens with dual-mapped shared memory.
Steps to reproduce
Here is a cut-down program that showcases the issue (Rust, but can easily be translated to C):
use std::{mem::transmute_copy, ptr::null_mut};
use libc::*;
// extern "C" fn ADD(usize, usize) -> usize
// add r0, r1, r0; bx lr
const ADD: &[u8] = b"\x00\x00\x81\xe0\x1e\xff\x2f\xe1";
// extern "C" fn SUB(usize, usize) -> usize
// sub r0, r1, r0; bx lr
const SUB: &[u8] = b"\x00\x00\x41\xe0\x1e\xff\x2f\xe1";
fn main() {
unsafe {
// allocate anonymous shared memory using memfd_create and create rx and rw mappings
let fd = libc::memfd_create(c"vmem".as_ptr(), MFD_CLOEXEC);
ftruncate(fd, 0x1000);
let rx = mmap(null_mut(), 0x1000, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0) as *const u8;
let rw = mmap(null_mut(), 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) as *mut u8;
close(fd);
std::ptr::copy_nonoverlapping(ADD.as_ptr(), rw, 8);
// flush icache for the page using the __ARM_NR_CACHEFLUSH syscall
syscall(0xf0002, rx, rx.byte_add(0x1000), 0);
let add: extern "C" fn(usize, usize) -> usize = transmute_copy(&rx);
add(0x12345, 0x54321);
std::ptr::copy_nonoverlapping(SUB.as_ptr(), rw, 8);
syscall(0xf0002, rx, rx.byte_add(0x1000), 0);
let sub: extern "C" fn(usize, usize) -> usize = transmute_copy(&rx);
sub(0xABCDE, 0xEDCBA);
}
}
Compile with --target=arm-unknown-linux-gnueabihf and run it through
qemu-arm -d in_asm,cpu test.elf 2>ins.log
to log instructions and CPU registers. Search for the program constants 12345 and abcde in the log to find the two JIT'd function calls.
The first call, add(0x12345, 0x54321), is translated correctly:
in_asm,cpu trace for add(0x12345, 0x54321)
----------------
IN: _ZN8qemu_bug4main17h238291d36a4a51f6E
0x00408924: e59f0044 ldr r0, [pc, #0x44]
0x00408928: e59f1044 ldr r1, [pc, #0x44]
0x0040892c: e12fff34 blx r4 <-- Call to add(0x12345, 0x54321)
R00=00000000 R01=40830000 R02=00000000 R03=40831000
R04=40830000 R05=40831000 R06=40830000 R07=40831000
R08=000f0002 R09=00000000 R10=004525f8 R11=407ff848
R12=407ff6f8 R13=407ff6f8 R14=00408924 R15=00408924
PSR=00000010 ---- A usr32
----------------
IN:
0x40830000: e0810000 add r0, r1, r0 <-- Code of ADD, as expected
0x40830004: e12fff1e bx lr
R00=00012345 R01=00054321 R02=00000000 R03=40831000
R04=40830000 R05=40831000 R06=40830000 R07=40831000
R08=000f0002 R09=00000000 R10=004525f8 R11=407ff848
R12=407ff6f8 R13=407ff6f8 R14=00408930 R15=40830000
PSR=00000010 ---- A usr32
----------------
However, the second call uses the outdated, cached instructions:
in_asm,cpu trace for sub(0xABCDE, 0xEDCBA)
----------------
IN: _ZN8qemu_bug4main17h238291d36a4a51f6E
0x004089bc: e59f0018 ldr r0, [pc, #0x18]
0x004089c0: e59f1018 ldr r1, [pc, #0x18]
0x004089c4: e12fff36 blx r6 <-- Call to sub(0xABCDE, 0xEDCBA)
R00=00000000 R01=40830000 R02=00000000 R03=40831000
R04=40831000 R05=40831004 R06=40830000 R07=00e12fff
R08=0000e081 R09=00e08100 R10=40831000 R11=0000e12f
R12=407ff6f0 R13=407ff6f0 R14=004089bc R15=004089bc
PSR=00000010 ---- A usr32
R00=000abcde R01=000edcba R02=00000000 R03=40831000
R04=40831000 R05=40831004 R06=40830000 R07=00e12fff
R08=0000e081 R09=00e08100 R10=40831000 R11=0000e12f
R12=407ff6f0 R13=407ff6f0 R14=004089c8 R15=40830000
PSR=00000010 ---- A usr32
----------------
* New instructions are not translated; this is the epilogue of main *
IN: _ZN8qemu_bug4main17h238291d36a4a51f6E
0x004089c8: e28dd044 add sp, sp, #0x44
0x004089cc: e8bd8ff0 pop {r4, r5, r6, r7, r8, sb, sl, fp, pc}
R00=00199998 R01=000edcba R02=00000000 R03=40831000 <-- Result in R0 is 0xABCDE + 0xEDCBA
R04=40831000 R05=40831004 R06=40830000 R07=00e12fff
R08=0000e081 R09=00e08100 R10=40831000 R11=0000e12f
R12=407ff6f0 R13=407ff6f0 R14=004089c8 R15=004089c8
PSR=00000010 ---- A usr32
----------------
Indeed, looking at the register dump at the epilogue entry we have R00=00199998 which is the result of add(0xABCDE, 0xEDCBA) and not sub(0xABCDE - 0xEDCBA).