Partial vectorization leads to slow(er) code (Regression from Eigen 3.2)
Summary
The attached example leads to roughly 3x slower performance in Eigen 3.3 (and Eigen 3.4-rc1) compared to the 3.2 branch. The example mainly accesses individual elements and performs computations with them, with a single squaredNorm of a 3-vector segment.
Environment
- Operating System : macOS 11.2.3 Big Sur (but similar behavior has been observed on Linux)
- Architecture : x64
- Eigen Version : 3.3.X, 3.4-rc1
- Compiler Version : clang 11.1.0, gcc 10.2.0
-
Compile Flags :
-std=c++17 -Ofast -march=native -fno-finite-math-only -fno-stack-protector -
Vector Extension :
AVX2 FDPEO SMEP BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI
Minimal Example
Steps to reproduce
- Compile the included benchmark, e.g.
clang++ -std=c++17 -Ofast -march=native -fno-finite-math-only -fno-stack-protector eigen_bench_augment.cpp -I EIGEN_INC - Run the binary, which will print some timing information
- Repeat the above with Eigen 3.2 and contrast with Eigen 3.3 (or 3.4-rc1).
- Try 3.3 or 3.4 without partial vectorization
-DEIGEN_UNALIGNED_VECTORIZE=0.
What is the current bug behavior?
Performance is slower than expected with Eigen 3.3 and 3.4-rc1. On my system 3.3/4 with partial vectorization enabled (which is the default) is roughly 3x slower than Eigen 3.2 (or disabling partial vectorization).
What is the expected correct behavior?
Partial vectorization should only be used in cases where it is profitable. Disabling it should not lead to a speed-up.
Anything else that might help
To me, the generated code doesn't look 3x slower, so I'm not sure what's going on. I see two more uses of FMA in the Eigen 3.2 assembly but that also shouldn't lead to such a large difference.
As mentioned before, Eigen 3.2 (which didn't support partial vectorization) didn't have this issue.