std::fma is very slow in WASM
Summary
I've been testing the mainline of Eigen in wasm compared to the released 3.4. Initially mainline was much slower than 3.4 for our application. I think I've traced it down to the use of std::fma. In wasm, std::fma seems to be much slower than doing a*b+c directly, while when running native it's only slightly slower. Maybecould we add one, to tell Eigen to turn std::fma into a basic a*b+c? SIMD does not seem to make a difference. I've tested on two machines, one with an AMD Ryzen Threadripper PRO 7975WX, and an M2 macbook. Which machine it is run on makes more of a difference than where it was compiled; if I run the compiled wasm from the Ryzen machine on the M2, I get similar results. The version of node I use can also make a difference. The above results were with node 22.16.0. If I use node 24.6.0, results improve a little bit (tested on the M2), but is still a big slowdown compared to a basic a*b+c.
Environment
- Operating System : Windows/Linux/Mac/Web
- Architecture : x64/Arm64/WASM ...
- Eigen Version : main
- Compiler Version : Emscripten/Gcc/Clang
- Compile Flags : -O3
- Vector Extension : N/A
Minimal Example
#include <cmath>
#include <iostream>
#include <chrono>
#include <vector>
#include <random>
int main(int argc, char** argv)
{
int num_calls = 1000000;
std::mt19937 gen(1);
std::uniform_real_distribution<> dis(-1.0, 1.0);
std::vector<double> vals(num_calls);
for (int i = 0; i < num_calls; ++i)
{
vals[i] = dis(gen);
}
double result = 0;
auto start = std::chrono::system_clock::now();
for (int j = 0; j < 10; ++j)
{
for (int i = 0; i < num_calls; ++i)
{
result += std::fma(vals[i], vals[i], vals[i]);
}
}
auto end = std::chrono::system_clock::now();
auto fma_time = std::chrono::duration<double>(end - start).count();
std::cout << "using std::fma: " << fma_time << "s " << result << std::endl;
result = 0;
start = std::chrono::system_clock::now();
for (int j = 0; j < 10; ++j)
{
for (int i = 0; i < num_calls; ++i)
{
result += vals[i] * vals[i] + vals[i];
}
}
end = std::chrono::system_clock::now();
auto basic_time = std::chrono::duration<double>(end - start).count();
std::cout << "using basic: " << basic_time << "s " << result << std::endl;
std::cout << "slowdown factor: " << fma_time / basic_time << std::endl;
}
OPT_FLAGS=-O3
fma_test: fma_test.cpp
g++ ${OPT_FLAGS} fma_test.cpp -o fma_test
fma_test.cjs: fma_test.cpp
em++ ${OPT_FLAGS} fma_test.cpp -o fma_test.cjs
fma_test_simd.cjs: fma_test.cpp
em++ ${OPT_FLAGS} -msimd128 -msse4.2 fma_test.cpp -o fma_test_simd.cjs
all: fma_test fma_test.cjs fma_test_simd.cjs
run: all
@echo "Native:"
@./fma_test ;
@echo ""
@echo "WASM no simd:"
@node ./fma_test.cjs ;
@echo ""
@echo "WASM simd:"
@node ./fma_test_simd.cjs ;
clean:
rm fma_test fma_test.cjs fma_test.wasm fma_test_simd.cjs fma_test_simd.wasm
Steps to reproduce
- Run attached timing code
What is the current bug behavior?
What is the expected correct behavior?
Relevant logs
Results with node 22.16.0.
On AMD Ryzen Threadripper PRO 7975WX:
Native:
using std::fma: 0.0242092s 3.34671e+06
using basic: 0.0148272s 3.34671e+06
slowdown factor: 1.63275
WASM no simd:
using std::fma: 0.198s 3.34671e+06
using basic: 0.010001s 3.34671e+06
slowdown factor: 19.798
WASM simd:
using std::fma: 0.202999s 3.34671e+06
using basic: 0.010001s 3.34671e+06
slowdown factor: 20.2979
On M2 Macbook:
Native:
using std::fma: 0.015125s 3.34671e+06
using basic: 0.012715s 3.34671e+06
slowdown factor: 1.18954
WASM no simd:
using std::fma: 0.154999s 3.34671e+06
using basic: 0.013999s 3.34671e+06
slowdown factor: 11.0721
WASM simd:
using std::fma: 0.155s 3.34671e+06
using basic: 0.015s 3.34671e+06
slowdown factor: 10.3333
Results with node 24.6.0 on M2 macbook
Native:
using std::fma: 0.016057s 3.34671e+06
using basic: 0.012931s 3.34671e+06
slowdown factor: 1.24174
WASM no simd:
using std::fma: 0.115001s 3.34671e+06
using basic: 0.015001s 3.34671e+06
slowdown factor: 7.66622
WASM simd:
using std::fma: 0.115s 3.34671e+06
using basic: 0.013999s 3.34671e+06
slowdown factor: 8.21487
Anything else that might help
- Should this be filed as a bug in emscripten?