SSE/AVX use fmaddsub for complex products
Reference issue
What does this implement/fix?
Interestingly, clang does not automagically fuse vmulps and vaddsubps into vfmaddsub (even with -ffast-math). This micro-optimizes the SSE/AVX complex multiplication kernels. This approach was already implemented in AVX512.
Clang (x86/64 trunk) -O3 -DNDEBUG -mavx2 -mfma:
| Old | New |
|---|---|
| vmovsldup | vmovshdup |
| vmulps | vshufps |
| vmovshdup | vmulps |
| vshufps | vmovsldup |
| vmulps | vfmaddsub213ps |
| vaddsubps | |
| ret | ret |
There may be some creative approaches to implementing pmadd, but they were not apparent to me.