SSE/AVX Complex FMA
Reference issue
What does this implement/fix?
Adds SSE and AVX implementations of complex fused-multiply-add and friends. These emit fewer instructions than composing pmadd(a,b,c) as padd(pmul(a,b),c) and are slightly more accurate. We should look into defining a generic packet op that streamlines pmul(pconj(a), b) and the FMA analogues, as we could get the conjugation for "free" by choosing the right intrinsics in the right order. We could call it pcmul(a,b) and pcmadd(a,b,c). This could be useful for squeezing a bit more performance out of dot products and the like.
Additional information
Edited by Charles Schlosser