Remove inline assembly for FMA (AVX) and add remaining extensions as packet ops: pmsub, pnmadd, and pnmsub.
Adding the additional variation can save explicit negations in various low-level implementations. In a followup to this change, they will be used to make preciprocal IEEE compliant with minimal overhead.
This change also removes the old workaround for register spilling in Eigen/src/Core/arch/AVX/PacketMath.h, which appears very counterproductive on modern compiler/CPU combos. For example, compiling a matrix multiplication benchmark with clang 11 without the workaround yields the following speedups on a Skylake core (in addition to the improved readability).
| flags | speedup |
|---|---|
| -march=skylake | 25% (!) |
| -mavx -mfma | 12% (!) |
| -mavx | unchanged |
Closes #2231 (closed)
Edited by Rasmus Munk Larsen