ARM64 SME ST1Q incorrectly stores 9 bytes (rather than 16) per 128-bit element
Host environment
- Operating system: Ubuntu 20.04.5 LTS
- OS/kernel version: Linux 5.15.0-1040-aws #45~20.04.1-Ubuntu SMP Tue Jul 11 19:11:12 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
- Architecture: ARM64
- QEMU flavor: qemu-aarch64
- QEMU version: qemu-aarch64 version 8.0.93 (also confirmed on master branch)
- QEMU command line: qemu-aarch64 -cpu max,sme=on ./qemu_repro
Emulated/Virtualized environment
N/A
Description of problem
QEMU incorrectly stores 9 bytes instead of 16 per 128-bit element in the ST1Q SME instruction (https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/ST1Q--Contiguous-store-of-quadwords-from-128-bit-element-ZA-tile-slice-). It copies the first byte of the upper 64-bits, then lower the 64-bits.
This seems to be a simple issue; I tracked it down to: https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/sme_helper.c?ref_type=heads#L382
Updating that + 1
to a + 8
fixes the problem.
Steps to reproduce
#include <stdio.h>
#include <stdint.h>
#include <string.h>
void st1q_sme_copy_test(uint8_t* src, uint8_t* dest) {
asm volatile(
"smstart sm\n"
"smstart za\n"
"ptrue p0.b\n"
"mov x12, xzr\n"
"ld1q {za0h.q[w12, 0]}, p0/z, %0\n"
"st1q {za0h.q[w12, 0]}, p0, %1\n"
"smstop za\n"
"smstop sm\n" : : "m"(*src), "m"(*dest) : "w12", "p0");
}
void print_first_128(uint8_t* data) {
putchar('[');
for (int i = 0; i < 16; i++) {
printf("%02d", data[i]);
if (i != 15)
printf(", ");
}
printf("]\n");
}
int main() {
_Alignas(16) uint8_t dest[512] = { };
_Alignas(16) uint8_t src[512] = { };
for (int i = 0; i < sizeof(src); i++)
src[i] = i;
puts("Before");
printf(" src: ");
print_first_128(src);
printf("dest: ");
print_first_128(dest);
st1q_sme_copy_test(src, dest);
puts("\nAfter ");
printf(" src: ");
print_first_128(src);
printf("dest: ");
print_first_128(dest);
}
Compile with (requires at least clang ~14, tested with clang 16):
clang ./qemu_repro.c -march=armv9-a+sme+sve -o ./qemu_repro
Run with:
qemu-aarch64 -cpu max,sme=on ./qemu_repro
It's expected just to copy from src
to dest
and output:
Before
src: [00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15]
dest: [00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00]
After
src: [00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15]
dest: [00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15]
But currently outputs:
Before
src: [00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15]
dest: [00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00]
After
src: [00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15]
dest: [00, 08, 09, 10, 11, 12, 13, 14, 15, 00, 00, 00, 00, 00, 00, 00]
Additional information
N/A