qemu8-user on Linux: SIGSEGV because brk(NULL) does not exist

Host environment

Operating system: any recent Linux of the last several years
OS/kernel version: for example: Linux host 6.2.15-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 11 16:51:53 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Architecture:x86_64
QEMU flavor:qemu-user-x86_64
QEMU version:commit fcb237e6 of 2023-07-10
QEMU command line: qemu-x86_64 -strace upx-4.0.2-amd64_linux/upx --version

Emulated/Virtualized environment

Operating system: same as Host
OS/kernel version: same as Host
Architecture: same as Host

Description of problem

On Linux, the return value of the system call brk(NULL) need not point to a page that exists. If so, then qemu8-user will generate SIGSEGV at the next call to brk() with a higher value, because qemu8 believes that it should maintain contiguous .bss with bytes of value 0. Thus qemu8-user so calls `memset(g2h_untagged(target_brk), 0, brk_page - target_brk); in do_brk() at ../linux-user/syscall.c:867, and this generates SIGSEGV at the non-existent page that covers brk(NULL).

Instead, the safest thing to do is nothing at all. Linux deliberately returns a random value for brk(NULL), subject to the conditions that the value be at least as large as the maximum over all PT_LOAD of (.p_vaddr + .p_memsz), and "somewhat near" that maximum. The purpose of randomness is to use variability to interfere with effectiveness of malware, and to expose application coding errors regarding brk() and sbrk(). If qemu-user wants to preserve contiguous .bss, then qemu-user should call memset() only if the first page of the range exists. (As explained in the next paragraph, "contiguous .bss" is a murky concept.)

Linux itself is partly to blame, because it computes the maximum (.p_vaddr + .p_memsz) over all the PT_LOAD of the most recent execve(). The most recent execve() seen by Linux might have no relationship to the state of the address space at the time of either call to brk(). The app can do arbitrary mmap, munmap, mprotect at any time. In particular, the run-time de-compressor of UPX does exactly that for a compressed main program. The maximum computed by Linux is for the compressed program, which has a different layout than the de-compressed program.

There is a Linux system call prctl(PR_SET_MM_BRK, new_value) which sets a value for "the brk", but that syscall tries to validate the new_value based on the most recent execve(). Once again, that has no relationship to the current layout of the address space produced by the UPX de-compressor.

Steps to reproduce

build qemu8-x86_64 from

commit fcb237e64f9d026c03d635579c7b288d0008a6e5 (HEAD -> master, origin/master, origin/HEAD)
Merge: 2ff49e96ac c00aac6f14
Date:   Mon Jul 10 09:17:06 2023 +0100

run build/qemu-x86_64 -strace upx-4.0.2-amd64_linux/upx --version where the upx is from https://github.com/upx/upx/releases/download/v4.0.2/upx-4.0.2-amd64_linux.tar.xz
output ends with

372621 close(3) = 0
372621 munmap(0x0000004000803000,3055) = 0   ## last syscall by upx stub de-compressor
372621 arch_prctl(4098,8326728,16,0,131072,0) = 0   ## first syscall by PT_INTERP ld-linux.so.2
372621 set_tid_address(0x7f1054) = 372621
372621 brk(NULL) = 0x0000000000874f46
372621 brk(0x0000000000887000)
(gdb) SIGSEGV at 0x0000000000874f46

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information