std::fma is very slow in WASM

Summary

I've been testing the mainline of Eigen in wasm compared to the released 3.4. Initially mainline was much slower than 3.4 for our application. I think I've traced it down to the use of std::fma. In wasm, std::fma seems to be much slower than doing a*b+c directly, while when running native it's only slightly slower. Maybecould we add one, to tell Eigen to turn std::fma into a basic a*b+c? SIMD does not seem to make a difference. I've tested on two machines, one with an AMD Ryzen Threadripper PRO 7975WX, and an M2 macbook. Which machine it is run on makes more of a difference than where it was compiled; if I run the compiled wasm from the Ryzen machine on the M2, I get similar results. The version of node I use can also make a difference. The above results were with node 22.16.0. If I use node 24.6.0, results improve a little bit (tested on the M2), but is still a big slowdown compared to a basic a*b+c.

Environment

  • Operating System : Windows/Linux/Mac/Web
  • Architecture : x64/Arm64/WASM ...
  • Eigen Version : main
  • Compiler Version : Emscripten/Gcc/Clang
  • Compile Flags : -O3
  • Vector Extension : N/A

Minimal Example

#include <cmath>
#include <iostream>
#include <chrono>
#include <vector>
#include <random>

int main(int argc, char** argv)
{
    int num_calls = 1000000;

    std::mt19937 gen(1);
    std::uniform_real_distribution<> dis(-1.0, 1.0);

    std::vector<double> vals(num_calls);
    for (int i = 0; i < num_calls; ++i)
    {
        vals[i] = dis(gen);
    }

    double result = 0;
    auto start = std::chrono::system_clock::now();
    for (int j = 0; j < 10; ++j)
    {
        for (int i = 0; i < num_calls; ++i)
        {
            result += std::fma(vals[i], vals[i], vals[i]);
        }
    }
    auto end = std::chrono::system_clock::now();
    auto fma_time = std::chrono::duration<double>(end - start).count();

    std::cout << "using std::fma: " << fma_time << "s  " << result << std::endl;

    result = 0;
    start = std::chrono::system_clock::now();
    for (int j = 0; j < 10; ++j)
    {
        for (int i = 0; i < num_calls; ++i)
        {
            result += vals[i] * vals[i] + vals[i];
        }
    }
    end = std::chrono::system_clock::now();
    auto basic_time = std::chrono::duration<double>(end - start).count();

    std::cout << "using basic: " << basic_time << "s  " << result << std::endl;
    std::cout << "slowdown factor: " << fma_time / basic_time << std::endl;
}
OPT_FLAGS=-O3

fma_test: fma_test.cpp
g++ ${OPT_FLAGS} fma_test.cpp -o fma_test

fma_test.cjs: fma_test.cpp
em++ ${OPT_FLAGS} fma_test.cpp -o fma_test.cjs

fma_test_simd.cjs: fma_test.cpp
em++ ${OPT_FLAGS} -msimd128 -msse4.2 fma_test.cpp -o fma_test_simd.cjs

all: fma_test fma_test.cjs fma_test_simd.cjs

run: all
@echo "Native:"
@./fma_test ;
@echo ""
@echo "WASM no simd:"
@node ./fma_test.cjs ;
@echo ""
@echo "WASM simd:"
@node ./fma_test_simd.cjs ;

clean:
rm fma_test fma_test.cjs fma_test.wasm fma_test_simd.cjs fma_test_simd.wasm

Steps to reproduce

  1. Run attached timing code

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs

Results with node 22.16.0.

On AMD Ryzen Threadripper PRO 7975WX:
Native:
using std::fma: 0.0242092s  3.34671e+06
using basic: 0.0148272s  3.34671e+06
slowdown factor: 1.63275

WASM no simd:
using std::fma: 0.198s  3.34671e+06
using basic: 0.010001s  3.34671e+06
slowdown factor: 19.798

WASM simd:
using std::fma: 0.202999s  3.34671e+06
using basic: 0.010001s  3.34671e+06
slowdown factor: 20.2979


On M2 Macbook:
Native:
using std::fma: 0.015125s  3.34671e+06
using basic: 0.012715s  3.34671e+06
slowdown factor: 1.18954

WASM no simd:
using std::fma: 0.154999s  3.34671e+06
using basic: 0.013999s  3.34671e+06
slowdown factor: 11.0721

WASM simd:
using std::fma: 0.155s  3.34671e+06
using basic: 0.015s  3.34671e+06
slowdown factor: 10.3333

Results with node 24.6.0 on M2 macbook

Native:
using std::fma: 0.016057s  3.34671e+06
using basic: 0.012931s  3.34671e+06
slowdown factor: 1.24174

WASM no simd:
using std::fma: 0.115001s  3.34671e+06
using basic: 0.015001s  3.34671e+06
slowdown factor: 7.66622

WASM simd:
using std::fma: 0.115s  3.34671e+06
using basic: 0.013999s  3.34671e+06
slowdown factor: 8.21487

Anything else that might help

  • Should this be filed as a bug in emscripten?
Edited by Timothy Langlois