Remove inline assembly for FMA (AVX) and add remaining extensions as packet ops: pmsub, pnmadd, and pnmsub.

Adding the additional variation can save explicit negations in various low-level implementations. In a followup to this change, they will be used to make preciprocal IEEE compliant with minimal overhead.

This change also removes the old workaround for register spilling in Eigen/src/Core/arch/AVX/PacketMath.h, which appears very counterproductive on modern compiler/CPU combos. For example, compiling a matrix multiplication benchmark with clang 11 without the workaround yields the following speedups on a Skylake core (in addition to the improved readability).

flags speedup
-march=skylake 25% (!)
-mavx -mfma 12% (!)
-mavx unchanged

Closes #2231 (closed)

Edited by Rasmus Munk Larsen

Merge request reports

Loading