Skip to content

SSE/AVX use fmaddsub for complex products

Reference issue

What does this implement/fix?

Interestingly, clang does not automagically fuse vmulps and vaddsubps into vfmaddsub (even with -ffast-math). This micro-optimizes the SSE/AVX complex multiplication kernels. This approach was already implemented in AVX512.

Clang (x86/64 trunk) -O3 -DNDEBUG -mavx2 -mfma:

Old New
vmovsldup vmovshdup
vmulps vshufps
vmovshdup vmulps
vshufps vmovsldup
vmulps vfmaddsub213ps
vaddsubps
ret ret

There may be some creative approaches to implementing pmadd, but they were not apparent to me.

Additional information

Merge request reports

Loading