A bug in AArch64 FMOPA/FMOPS (widening) instruction

Host environment

Operating system: Ubuntu 22.04
OS/kernel version: Linux lin005 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Architecture: x86-64
QEMU flavor: qemu-aarch64
QEMU version: qemu-aarch64 version 9.0.50 (v9.0.0-1123-g74abb45dac)

QEMU command line:

./qemu-aarch64 -L /usr/aarch64-linux-gnu/ -cpu max,sme256=on ./test.bin

Emulated/Virtualized environment

Operating system: N/A (qemu-user)
OS/kernel version: N/A (qemu-user)
Architecture: aarch64

Description of problem

fmopa computes the multiplication of two matrices in the source registers and accumulates the result to the destination register. A source register’s element size is 16 bits, while a destination register’s element size is 64 bits in the case of widening variant of this instruction. Before the matrix multiplication is performed, each element should be converted to a 64-bit floating point. FPCR flags are considered when converting floating point values. Especially, when the FZ (or FZ16) flag is set, denormalized values are converted into zero. When the floating point size is 16 bits, FZ16 should be considered; otherwise, FZ flag should be used.

However, the current implementation only considers FZ flag, not FZ16 flag, so it computes the wrong value.

Steps to reproduce

Write test.c.

#include <stdio.h>

char i_P2[4] = { 0xff, 0xff, 0xff, 0xff };
char i_P5[4] = { 0xff, 0xff, 0xff, 0xff };
char i_Z0[32] = { // Set only the first element as non-zero
    0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
};
char i_Z16[32] = { // Set only the first element as non-zero
    0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
};
char i_ZA3H[128] = { 0x0, };
uint64_t i_fpcr = 0x0001000000; // FZ = 1;
char o_ZA3H[128];

void __attribute__ ((noinline)) show_state() {
    for (int i = 0; i < 8; i++) {
        for (int j = 0; j < 16; j++) {
            printf("%02x ", o_ZA3H[16*i+j]);
        }
        printf("\n");
    }
}

void __attribute__ ((noinline)) run() {
    __asm__ (
        ".arch armv9.3-a+sme\n"
        "smstart\n"
        "adrp x29, i_P2\n"
        "add x29, x29, :lo12:i_P2\n"
        "ldr p2, [x29]\n"
        "adrp x29, i_P5\n"
        "add x29, x29, :lo12:i_P5\n"
        "ldr p5, [x29]\n"
        "adrp x29, i_Z0\n"
        "add x29, x29, :lo12:i_Z0\n"
        "ldr z0, [x29]\n"
        "adrp x29, i_Z16\n"
        "add x29, x29, :lo12:i_Z16\n"
        "ldr z16, [x29]\n"
        "adrp x29, i_ZA3H\n"
        "add x29, x29, :lo12:i_ZA3H\n"
        "mov x15, 0\n"
        "ld1w {za3h.s[w15, 0]}, p2, [x29]\n"
        "add x29, x29, 32\n"
        "ld1w {za3h.s[w15, 1]}, p2, [x29]\n"
        "add x29, x29, 32\n"
        "mov x15, 2\n"
        "ld1w {za3h.s[w15, 0]}, p2, [x29]\n"
        "add x29, x29, 32\n"
        "ld1w {za3h.s[w15, 1]}, p2, [x29]\n"
        "adrp x29, i_fpcr\n"
        "add x29, x29, :lo12:i_fpcr\n"
        "ldr x29, [x29]\n"
        "msr fpcr, x29\n"
        ".inst 0x81a0aa03\n" // fmopa   za3.s, p2/m, p5/m, z16.h, z0.h
        "adrp x29, o_ZA3H\n"
        "add x29, x29, :lo12:o_ZA3H\n"
        "mov x15, 0\n"
        "st1w {za3h.s[w15, 0]}, p2, [x29]\n"
        "add x29, x29, 32\n"
        "st1w {za3h.s[w15, 1]}, p2, [x29]\n"
        "add x29, x29, 32\n"
        "mov x15, 2\n"
        "st1w {za3h.s[w15, 0]}, p2, [x29]\n"
        "add x29, x29, 32\n"
        "st1w {za3h.s[w15, 1]}, p2, [x29]\n"
        ".arch armv8-a\n"
    );
}

int main(int argc, char **argv) {
    run();
    show_state();
    return 0;
}

Compile test.bin using this command: aarch64-linux-gnu-gcc-12 -O2 -no-pie ./test.c -o ./test.bin.
Run QEMU using this command: qemu-aarch64 -L /usr/aarch64-linux-gnu/ -cpu max,sme256=on ./test.bin.
The program, runs on top of the buggy QEMU, prints only zero bytes. It should print 00 01 7e 2f + 00 .. (rest of bytes) .. 00 after the bug is fixed.

A bug in AArch64 FMOPA/FMOPS (widening) instruction

Host environment

Emulated/Virtualized environment

Description of problem

Steps to reproduce

Additional information