Skip to content

optimize setConstant, setZero

Reference issue

What does this implement/fix?

Currently, we swap out our usual assignment logic in setConstant for std::fill_n if the type is a PlainObject, i.e. Matrix or Array. We should do the same for eligible blocks or any types where the data is a contiguous array. The eligible types are:

  • Array
  • Matrix
  • Blocks with InnerPanel == true (leftCols, middleCols, rightCols for column major matrices, likewise for row major)
  • Blocks representing a vector that matches the source expression's storage order (head, segment, tail)

The gains are substantial for sizes up to 8192 (with some fluke noise at 1024), and then it becomes marginally different.

Benefit of enabling fill_n in setConstant for eligible blocks (clang++-18 -DNDEBUG -O3)

Size Real Time Cpu Time
1 -56.05% -56.04%
2 -43.60% -43.61%
4 -12.39% -12.39%
8 -33.83% -33.82%
16 -34.88% -34.88%
32 -57.11% -57.10%
64 -64.02% -64.02%
128 -33.04% -33.04%
256 -8.39% -8.39%
512 -8.97% -8.95%
1024 8.22% 8.22%
2048 -21.33% -21.31%
4096 -6.15% -6.14%
8192 -4.56% -4.54%
16384 -0.06% -0.06%
32768 -0.07% -0.07%
65536 -0.02% -0.02%
131072 -0.12% -0.12%
262144 -0.13% -0.13%
524288 1.97% 1.97%
1048576 -0.05% -0.03%

However, users on discord reported that MSVC's fill_n is terrible and is slower than our assignment logic. The hot spot in fill_n for these users is a branch that determines if the value is is all zero bits. If so, fill_n switches to memset. I figure we should directly use memset without branching if the user calls setZero(). The requirements to use memset are the same as above, with the additional requirement that the type is arithmetic (a built-in scalar, essentially). The gains are modest to size 16, then almost nothing. My real intent was to provide MSVC users with a decent optimization for setZero since fill_n is no good.

Benefit of using memset in setZero (clang++-18 -DNDEBUG -O3)

Size Real Time CPU Time
1 -11.10% -11.10%
2 -12.59% -12.57%
4 -14.37% -14.37%
8 -10.96% -10.96%
16 -10.95% -10.93%
32 -0.01% -0.01%
64 0.01% 0.04%
128 -0.89% -0.89%
256 0.12% 0.14%
512 0.25% 0.25%
1024 -0.09% -0.09%
2048 -1.02% -1.02%
4096 -0.02% 0.01%
8192 -0.16% -0.16%
16384 -0.18% -0.18%
32768 0.03% 0.06%
65536 -0.05% -0.05%
131072 -0.01% 0.01%
262144 0.08% 0.08%
524288 5.03% 5.05%
1048576 -0.41% -0.41%

Additional information

test:

#include <benchmark/benchmark.h>
#include <Eigen/Core>

using namespace Eigen;

template <typename T>
static void set_constant(benchmark::State& state) {
  int n = state.range(0);
  Eigen::Matrix<T, Eigen::Dynamic, 1> v(n);
  auto v_map = v.head(n);
  v.setRandom();
  const T rando = internal::random<T>();
  for (auto s : state) {
    benchmark::DoNotOptimize(v_map.setConstant(rando));
  }
}

template <typename T>
static void set_zero(benchmark::State& state) {
  int n = state.range(0);
  Eigen::Matrix<T, Eigen::Dynamic, 1> v(n);
  auto v_map = v.head(n);
  v.setRandom();
  for (auto s : state) {
    benchmark::DoNotOptimize(v_map.setZero());
  }
}

BENCHMARK(set_constant<float>)->RangeMultiplier(2)->Range(1, 1<<20)->Threads(1);
BENCHMARK(set_zero<float>)->RangeMultiplier(2)->Range(1, 1<<20)->Threads(1);
BENCHMARK_MAIN();
Edited by Charles Schlosser

Merge request reports

Loading