ARM user-mode: TBs of self-modifying code not invalidated after __ARM_NR_CACHEFLUSH syscall

Host environment

  • Operating system: Arch Linux (CachyOS)
  • OS/kernel version: Linux 6.17.1-2-cachyos
  • Architecture: x86_64
  • QEMU flavor: qemu-arm
  • QEMU version: 10.1.0

Emulated/Virtualized environment

  • Operating system: Linux (user mode emulation)
  • OS/kernel version: N/A (user mode emulation)
  • Architecture: ARM (arm-linux-gnueabihf)

Description of problem

On ARM user-mode emulation, when creating anonymous shared memory using memfd_create and mapping both a R/X and R/W view for the purposes of just-in-time code generation and execution, TCG does not invalidate translated code executed through the RX pointer when it is modified through the RW pointer.

This happens despite using the appropriate syscall for invalidating the icache on ARM Linux targets, i.e. __ARM_NR_CACHEFLUSH (0x0f0002). Currently QEMU does not do anything when receiving this syscall.

Note that this issue does not occur when using a single RWX region to emit and later modify code. It only happens with dual-mapped shared memory.

Steps to reproduce

Here is a cut-down program that showcases the issue (Rust, but can easily be translated to C):

use std::{mem::transmute_copy, ptr::null_mut};

use libc::*;

// extern "C" fn ADD(usize, usize) -> usize
// add r0, r1, r0; bx lr
const ADD: &[u8] = b"\x00\x00\x81\xe0\x1e\xff\x2f\xe1";

// extern "C" fn SUB(usize, usize) -> usize
// sub r0, r1, r0; bx lr
const SUB: &[u8] = b"\x00\x00\x41\xe0\x1e\xff\x2f\xe1";

fn main() {
    unsafe {
        // allocate anonymous shared memory using memfd_create and create rx and rw mappings
        let fd = libc::memfd_create(c"vmem".as_ptr(), MFD_CLOEXEC);
        ftruncate(fd, 0x1000);
        let rx = mmap(null_mut(), 0x1000, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0) as *const u8;
        let rw = mmap(null_mut(), 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) as *mut u8;
        close(fd);

        std::ptr::copy_nonoverlapping(ADD.as_ptr(), rw, 8);

        // flush icache for the page using the __ARM_NR_CACHEFLUSH syscall
        syscall(0xf0002, rx, rx.byte_add(0x1000), 0);

        let add: extern "C" fn(usize, usize) -> usize = transmute_copy(&rx);
        add(0x12345, 0x54321);

        std::ptr::copy_nonoverlapping(SUB.as_ptr(), rw, 8);
        syscall(0xf0002, rx, rx.byte_add(0x1000), 0);

        let sub: extern "C" fn(usize, usize) -> usize = transmute_copy(&rx);
        sub(0xABCDE, 0xEDCBA);
    }
}

Compile with --target=arm-unknown-linux-gnueabihf and run it through

qemu-arm -d in_asm,cpu test.elf 2>ins.log

to log instructions and CPU registers. Search for the program constants 12345 and abcde in the log to find the two JIT'd function calls.

The first call, add(0x12345, 0x54321), is translated correctly:

in_asm,cpu trace for add(0x12345, 0x54321)
----------------
IN: _ZN8qemu_bug4main17h238291d36a4a51f6E
0x00408924:  e59f0044  ldr      r0, [pc, #0x44]
0x00408928:  e59f1044  ldr      r1, [pc, #0x44]
0x0040892c:  e12fff34  blx      r4                  <-- Call to add(0x12345, 0x54321)

R00=00000000 R01=40830000 R02=00000000 R03=40831000
R04=40830000 R05=40831000 R06=40830000 R07=40831000
R08=000f0002 R09=00000000 R10=004525f8 R11=407ff848
R12=407ff6f8 R13=407ff6f8 R14=00408924 R15=00408924
PSR=00000010 ---- A usr32
----------------
IN: 
0x40830000:  e0810000  add      r0, r1, r0         <-- Code of ADD, as expected
0x40830004:  e12fff1e  bx       lr

R00=00012345 R01=00054321 R02=00000000 R03=40831000
R04=40830000 R05=40831000 R06=40830000 R07=40831000
R08=000f0002 R09=00000000 R10=004525f8 R11=407ff848
R12=407ff6f8 R13=407ff6f8 R14=00408930 R15=40830000
PSR=00000010 ---- A usr32
----------------

However, the second call uses the outdated, cached instructions:

in_asm,cpu trace for sub(0xABCDE, 0xEDCBA)
----------------
IN: _ZN8qemu_bug4main17h238291d36a4a51f6E
0x004089bc:  e59f0018  ldr      r0, [pc, #0x18]
0x004089c0:  e59f1018  ldr      r1, [pc, #0x18]
0x004089c4:  e12fff36  blx      r6                <-- Call to sub(0xABCDE, 0xEDCBA)

R00=00000000 R01=40830000 R02=00000000 R03=40831000
R04=40831000 R05=40831004 R06=40830000 R07=00e12fff
R08=0000e081 R09=00e08100 R10=40831000 R11=0000e12f
R12=407ff6f0 R13=407ff6f0 R14=004089bc R15=004089bc
PSR=00000010 ---- A usr32
R00=000abcde R01=000edcba R02=00000000 R03=40831000
R04=40831000 R05=40831004 R06=40830000 R07=00e12fff
R08=0000e081 R09=00e08100 R10=40831000 R11=0000e12f
R12=407ff6f0 R13=407ff6f0 R14=004089c8 R15=40830000
PSR=00000010 ---- A usr32
----------------
* New instructions are not translated; this is the epilogue of main * 

IN: _ZN8qemu_bug4main17h238291d36a4a51f6E
0x004089c8:  e28dd044  add      sp, sp, #0x44      
0x004089cc:  e8bd8ff0  pop      {r4, r5, r6, r7, r8, sb, sl, fp, pc}

R00=00199998 R01=000edcba R02=00000000 R03=40831000 <-- Result in R0 is 0xABCDE + 0xEDCBA
R04=40831000 R05=40831004 R06=40830000 R07=00e12fff
R08=0000e081 R09=00e08100 R10=40831000 R11=0000e12f
R12=407ff6f0 R13=407ff6f0 R14=004089c8 R15=004089c8
PSR=00000010 ---- A usr32
----------------

Indeed, looking at the register dump at the epilogue entry we have R00=00199998 which is the result of add(0xABCDE, 0xEDCBA) and not sub(0xABCDE - 0xEDCBA).

Edited by William Tremblay