Incorrect behavior with 64-bit element SDOT and UDOT instructions on ARM SVE when sve-default-vector-length>=64
Host environment
-
Operating system:
Ubuntu 22.04.5 LTS on Windows 10 x86_64
-
OS/kernel version:
5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-
Architecture:
x86_64
-
QEMU flavor:
qemu-aarch64
-
QEMU version:
version 9.1.50 (v9.1.0-475-ga53b931645)
-
QEMU command line: N/A, see steps to reproduce
Emulated/Virtualized environment
-
Operating system:
N/A
-
OS/kernel version:
N/A
-
Architecture:
Arm
Description of problem
The behavior of SDOT and UDOT instructions are incorrect when the Zresult.D register is used, which is the 64-bit svdot_lane_{s,u}64 intrinsic in ACLE.
I have tested the same code using Arm Instruction Emulator (which is deprecated though) and gem5 which produced correct result, I believe that the SDOT and UDOT implementation in qemu is incorrect.
Steps to reproduce
-
Get Arm Gnu toolchain from Arm GNU Toolchain Downloads – Arm Developer, for x86 Linux hosts, download arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu.tar.xz and extract it. Alternatively, use any compiler that is able to cross compile for armv8.2-a+sve targets.
-
Compile the following program with these compiler arguments
arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc -O3 -march=armv8.2-a+sve dot_lane.c -o dot_lane
#include <stdio.h> #include <arm_sve.h> int64_t a[32] = { 0 }; int16_t b[128]; int16_t c[128]; int64_t r[32]; int64_t expected_r[32]; #define IMM 0 int main(void) { for (size_t i = 0; i < 128; i++) { b[i] = 1; c[i] = i / 4; } svint64_t av = svld1(svptrue_b64(), a); svint16_t bv = svld1(svptrue_b16(), b); svint16_t cv = svld1(svptrue_b16(), c); svint64_t result = svdot_lane_s64(av, bv, cv, IMM); svst1(svptrue_b64(), r, result); for (size_t i = 0; i < svcntd(); i++) { expected_r[i] = (int64_t)b[i * 4 + 0] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 0] + (int64_t)b[i * 4 + 1] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 1] + (int64_t)b[i * 4 + 2] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 2] + (int64_t)b[i * 4 + 3] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 3] + a[i]; } printf("%12s", "r: "); for (size_t i = 0; i < svcntd(); i++) { printf("%4ld", r[i]); } printf("\n"); printf("%12s", "expected_r: "); for (size_t i = 0; i < svcntd(); i++) { printf("%4ld", expected_r[i]); } printf("\n\t\t"); for (size_t i = 0; i < svcntd(); i++) { if (r[i] != expected_r[i]) { printf("%4c", '^'); } else { printf("%4c", ' '); } } printf("\n"); printf("idx:\t\t"); for (size_t i = 0; i < svcntd(); i++) { if (r[i] != expected_r[i]) { printf("%4d", i); } else { printf("%4c", ' '); } } printf("\n"); return 0; }
-
Execute it with the following commands:
qemu-aarch64 -cpu max,sve-default-vector-length=16 -L arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/../aarch64-none-linux-gnu/libc dot_lane
Change the value of
sve-default-vector-length
to 32, 64, 128, 256 and observe the outputs, we should see that forsve-default-vector-length
>= 64, the result is incorrect.sve-default-vector-length=16
r: 0 0 expected_r: 0 0 idx:
sve-default-vector-length=32
r: 0 0 8 8 expected_r: 0 0 8 8 idx:
sve-default-vector-length=64
r: 0 0 8 8 8 8 24 24 expected_r: 0 0 8 8 16 16 24 24 ^ ^ idx: 4 5
sve-default-vector-length=128
r: 0 0 8 8 8 8 24 24 24 24 40 40 40 40 56 56 expected_r: 0 0 8 8 16 16 24 24 32 32 40 40 48 48 56 56 ^ ^ ^ ^ ^ ^ idx: 4 5 8 9 12 13
sve-default-vector-length=256
r: 0 0 8 8 8 8 24 24 24 24 40 40 40 40 56 56 56 56 72 72 72 72 88 88 88 88 104 104 104 104 120 120 expected_r: 0 0 8 8 16 16 24 24 32 32 40 40 48 48 56 56 64 64 72 72 80 80 88 88 96 96 104 104 112 112 120 120 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ idx: 4 5 8 9 12 13 16 17 20 21 24 25 28 29
-
By passing
-S
to the compiler, we can see that sdot (or udot if usingsvdot_lane_u64()
) is produced in assembly (sdot z0.d, z1.h, z2.h[0]
), which is correct behavior according to Intrinsics – Arm Developer.