optimize setConstant, setZero
Reference issue
What does this implement/fix?
Currently, we swap out our usual assignment logic in setConstant
for std::fill_n
if the type is a PlainObject
, i.e. Matrix
or Array
. We should do the same for eligible blocks or any types where the data is a contiguous array. The eligible types are:
- Array
- Matrix
- Blocks with InnerPanel == true (leftCols, middleCols, rightCols for column major matrices, likewise for row major)
- Blocks representing a vector that matches the source expression's storage order (head, segment, tail)
The gains are substantial for sizes up to 8192 (with some fluke noise at 1024), and then it becomes marginally different.
Benefit of enabling fill_n
in setConstant
for eligible blocks (clang++-18 -DNDEBUG -O3)
Size | Real Time | Cpu Time |
---|---|---|
1 | -56.05% | -56.04% |
2 | -43.60% | -43.61% |
4 | -12.39% | -12.39% |
8 | -33.83% | -33.82% |
16 | -34.88% | -34.88% |
32 | -57.11% | -57.10% |
64 | -64.02% | -64.02% |
128 | -33.04% | -33.04% |
256 | -8.39% | -8.39% |
512 | -8.97% | -8.95% |
1024 | 8.22% | 8.22% |
2048 | -21.33% | -21.31% |
4096 | -6.15% | -6.14% |
8192 | -4.56% | -4.54% |
16384 | -0.06% | -0.06% |
32768 | -0.07% | -0.07% |
65536 | -0.02% | -0.02% |
131072 | -0.12% | -0.12% |
262144 | -0.13% | -0.13% |
524288 | 1.97% | 1.97% |
1048576 | -0.05% | -0.03% |
However, users on discord reported that MSVC's fill_n
is terrible and is slower than our assignment logic. The hot spot in fill_n
for these users is a branch that determines if the value is is all zero bits. If so, fill_n
switches to memset
. I figure we should directly use memset
without branching if the user calls setZero().
The requirements to use memset
are the same as above, with the additional requirement that the type is arithmetic (a built-in scalar, essentially). The gains are modest to size 16, then almost nothing. My real intent was to provide MSVC users with a decent optimization for setZero
since fill_n
is no good.
Benefit of using memset
in setZero
(clang++-18 -DNDEBUG -O3)
Size | Real Time | CPU Time |
---|---|---|
1 | -11.10% | -11.10% |
2 | -12.59% | -12.57% |
4 | -14.37% | -14.37% |
8 | -10.96% | -10.96% |
16 | -10.95% | -10.93% |
32 | -0.01% | -0.01% |
64 | 0.01% | 0.04% |
128 | -0.89% | -0.89% |
256 | 0.12% | 0.14% |
512 | 0.25% | 0.25% |
1024 | -0.09% | -0.09% |
2048 | -1.02% | -1.02% |
4096 | -0.02% | 0.01% |
8192 | -0.16% | -0.16% |
16384 | -0.18% | -0.18% |
32768 | 0.03% | 0.06% |
65536 | -0.05% | -0.05% |
131072 | -0.01% | 0.01% |
262144 | 0.08% | 0.08% |
524288 | 5.03% | 5.05% |
1048576 | -0.41% | -0.41% |
Additional information
test:
#include <benchmark/benchmark.h>
#include <Eigen/Core>
using namespace Eigen;
template <typename T>
static void set_constant(benchmark::State& state) {
int n = state.range(0);
Eigen::Matrix<T, Eigen::Dynamic, 1> v(n);
auto v_map = v.head(n);
v.setRandom();
const T rando = internal::random<T>();
for (auto s : state) {
benchmark::DoNotOptimize(v_map.setConstant(rando));
}
}
template <typename T>
static void set_zero(benchmark::State& state) {
int n = state.range(0);
Eigen::Matrix<T, Eigen::Dynamic, 1> v(n);
auto v_map = v.head(n);
v.setRandom();
for (auto s : state) {
benchmark::DoNotOptimize(v_map.setZero());
}
}
BENCHMARK(set_constant<float>)->RangeMultiplier(2)->Range(1, 1<<20)->Threads(1);
BENCHMARK(set_zero<float>)->RangeMultiplier(2)->Range(1, 1<<20)->Threads(1);
BENCHMARK_MAIN();