Replace calls to numext::fma with numext:madd.
The function numext::fma should be reserved for when we actually
need the extended precision. In cases where the extra precision
is not necessary, madd will try to do the "best" thing:
- Use FMA if there is a CPU instruction for it (i.e.
EIGEN_VECTORIZE_FMA) - Otherwise, fall back to
x * y + z
This helps prevent excessive slowdowns. For example, with emscripten/WASM, the software-emulated FMA implementation is about 30x slower than a basic multiply-add. On Intel/AMD CPUs, the emulated FMA seems to be 3-5x slower than multiply-add. If FMA CPU instructions are available, then FMA seems to be on-par performance-wise with multiply-add, so we get the extra precision for free.
Fixes #2959 (closed).