Some x86_64 SSE operations have incorrect/erratic behaviours

Setup

The issue has been tested on Qemu

  • tags/v5.2.0
  • tags/v6.0.0

Machine:

Linux ubuntu 5.8.0-55-generic 20.04.1-Ubuntu SMP x86_64

Comand line:

 qemu-x86_64 -cpu max  ./sse_test

(the source code of test_sse is attached)

Issue details

Some x86 SSE operations implemented in `target/i386/op_sse.h' does not behave correctly.

The issue can be reproduced using the attached test code sse_test.c. It demonstrates the issue using the opcode pshufb. When running outside Qemu we get the following output from the tests program

user@ubuntu:~$ ./test_sse 
          Test for SSE operations issue in Qemu v6.0.0.
          0x7878787878787878 0x7878787878787878 0x0 0x0 
          0x7878787878787878 0x7878787878787878 

Under Qemu we have:

     user@ubuntu:~$ qemu/build/x86_64-linux-user/qemu-x86_64 -cpu max test_sse 
             Test for SSE operations issue in Qemu v6.0.0.
             0x7878787878787878 0x7878787878787878 0x30 0x40000052a0 
             0x7878787878787878 0x7878787878787878 

Note: the observed value may not be the same on your machine (it comes from an uninitialised variable).

Root cause

Some SSE helpers defined inside op_sse.h use uninitialised stack variables as local ZMM registers. At the end of these helpers, this local is copied into the destination register, regardless of its size. If only part of the local ZMM register has been used, then initialised stack memory will be copied into the destination register. Example: pshufb helper

                 /* SSSE3 op helpers */
                  void glue(helper_pshufb, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
                  {
                      int i;
                      Reg r;      // <<<--- uninitialised stack var (Reg resolves to ZMMReg)
                    
                      
                      for (i = 0; i < (8 << SHIFT); i++) {.  //<<<--- This loop may not access all fields of the Reg structure
                          r.B(i) = (s->B(i) & 0x80) ? 0 : (d->B(s->B(i) & ((8 << SHIFT) - 1)));
                      }

                      *d = r;    // <<<--- Copy stack partially initialised/modified local var r to d
                  }

Fix

I'm not sure how to fix this with a minimal impact on performance.

I guess that defining a dedicated helper for each register size can be a good way to fix this.

However, this may require modifications in i386/translate.c, which I currently do not fully understand.

sse_test.c

Edited by Stevie Lavern