Skip to content

Incorrect behavior with 64-bit element SDOT and UDOT instructions on ARM SVE when sve-default-vector-length>=64

Host environment

  • Operating system:

    Ubuntu 22.04.5 LTS on Windows 10 x86_64

  • OS/kernel version:

    5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • Architecture:

    x86_64

  • QEMU flavor:

    qemu-aarch64

  • QEMU version:

    version 9.1.50 (v9.1.0-475-ga53b931645)

  • QEMU command line: N/A, see steps to reproduce

Emulated/Virtualized environment

  • Operating system:

    N/A

  • OS/kernel version:

    N/A

  • Architecture:

    Arm

Description of problem

The behavior of SDOT and UDOT instructions are incorrect when the Zresult.D register is used, which is the 64-bit svdot_lane_{s,u}64 intrinsic in ACLE.

I have tested the same code using Arm Instruction Emulator (which is deprecated though) and gem5 which produced correct result, I believe that the SDOT and UDOT implementation in qemu is incorrect.

Steps to reproduce

  1. Get Arm Gnu toolchain from Arm GNU Toolchain Downloads – Arm Developer, for x86 Linux hosts, download arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu.tar.xz and extract it. Alternatively, use any compiler that is able to cross compile for armv8.2-a+sve targets.

  2. Compile the following program with these compiler arguments

    arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc -O3 -march=armv8.2-a+sve dot_lane.c -o dot_lane
    #include <stdio.h>
    #include <arm_sve.h>
    
    int64_t a[32] = { 0 };
    int16_t b[128];
    int16_t c[128];
    int64_t r[32];
    int64_t expected_r[32];
    
    #define IMM 0
    
    int main(void)
    {
        for (size_t i = 0; i < 128; i++) {
            b[i] = 1;
            c[i] = i / 4;
        }
    
        svint64_t av = svld1(svptrue_b64(), a);
        svint16_t bv = svld1(svptrue_b16(), b);
        svint16_t cv = svld1(svptrue_b16(), c);
    
        svint64_t result = svdot_lane_s64(av, bv, cv, IMM);
    
        svst1(svptrue_b64(), r, result);
    
        for (size_t i = 0; i < svcntd(); i++) {
            expected_r[i] = 
                (int64_t)b[i * 4 + 0] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 0] +
                (int64_t)b[i * 4 + 1] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 1] +
                (int64_t)b[i * 4 + 2] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 2] +
                (int64_t)b[i * 4 + 3] * (int64_t)c[(i - i % 2) * 4 + IMM * 4 + 3] +
                a[i];
        }
        
        printf("%12s", "r: ");
        for (size_t i = 0; i < svcntd(); i++) {
            printf("%4ld", r[i]);
        }
        printf("\n");
        printf("%12s", "expected_r: ");
        for (size_t i = 0; i < svcntd(); i++) {
            printf("%4ld", expected_r[i]);
        }
        printf("\n\t\t");
        for (size_t i = 0; i < svcntd(); i++) {
            if (r[i] != expected_r[i]) {
                printf("%4c", '^');
            } else {
                printf("%4c", ' ');
            }
        }
        printf("\n");
        printf("idx:\t\t");
        for (size_t i = 0; i < svcntd(); i++) {
            if (r[i] != expected_r[i]) {
                printf("%4d", i);
            } else {
                printf("%4c", ' ');
            }
        }
        printf("\n");
    
        return 0;
    }
  3. Execute it with the following commands:

    qemu-aarch64 -cpu max,sve-default-vector-length=16 -L arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/../aarch64-none-linux-gnu/libc dot_lane

    Change the value of sve-default-vector-length to 32, 64, 128, 256 and observe the outputs, we should see that for sve-default-vector-length >= 64, the result is incorrect.

    sve-default-vector-length=16

             r:    0   0
    expected_r:    0   0
                            
    idx:                

    sve-default-vector-length=32

             r:    0   0   8   8
    expected_r:    0   0   8   8
                                    
    idx:                        

    sve-default-vector-length=64

             r:    0   0   8   8   8   8  24  24
    expected_r:    0   0   8   8  16  16  24  24
                                       ^   ^        
    idx:                               4   5         

    sve-default-vector-length=128

             r:    0   0   8   8   8   8  24  24  24  24  40  40  40  40  56  56
    expected_r:    0   0   8   8  16  16  24  24  32  32  40  40  48  48  56  56
                                       ^   ^           ^   ^           ^   ^        
    idx:                               4   5           8   9          12  13       

    sve-default-vector-length=256

             r:    0   0   8   8   8   8  24  24  24  24  40  40  40  40  56  56  56  56  72  72  72  72  88  88  88  88 104 104 104 104 120 120
    expected_r:    0   0   8   8  16  16  24  24  32  32  40  40  48  48  56  56  64  64  72  72  80  80  88  88  96  96 104 104 112 112 120 120
                                       ^   ^           ^   ^           ^   ^           ^   ^           ^   ^           ^   ^           ^   ^        
    idx:                               4   5           8   9          12  13          16  17          20  21          24  25          28  29     
  4. By passing -S to the compiler, we can see that sdot (or udot if using svdot_lane_u64()) is produced in assembly (sdot z0.d, z1.h, z2.h[0]), which is correct behavior according to Intrinsics – Arm Developer.

Additional information

Edited by MeteorEmber
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information