SSE/AVX use fmaddsub for complex products
Reference issue
What does this implement/fix?
Interestingly, clang does not automagically fuse vmulps
and vaddsubps
into vfmaddsub
(even with -ffast-math). This micro-optimizes the SSE/AVX complex multiplication kernels. This approach was already implemented in AVX512.
Clang (x86/64 trunk) -O3 -DNDEBUG -mavx2 -mfma:
Old | New |
---|---|
vmovsldup | vmovshdup |
vmulps | vshufps |
vmovshdup | vmulps |
vshufps | vmovsldup |
vmulps | vfmaddsub213ps |
vaddsubps | |
ret | ret |
There may be some creative approaches to implementing pmadd
, but they were not apparent to me.