optimize setConstant, setZero (!1755) · Merge requests · libeigen / eigen

Reference issue

What does this implement/fix?

Currently, we swap out our usual assignment logic in setConstant for std::fill_n if the type is a PlainObject, i.e. Matrix or Array. We should do the same for eligible blocks or any types where the data is a contiguous array. The eligible types are:

Array
Matrix
Blocks with InnerPanel == true (leftCols, middleCols, rightCols for column major matrices, likewise for row major)
Blocks representing a vector that matches the source expression's storage order (head, segment, tail)

The gains are substantial for sizes up to 8192 (with some fluke noise at 1024), and then it becomes marginally different.

Benefit of enabling fill_n in setConstant for eligible blocks (clang++-18 -DNDEBUG -O3)

Size	Real Time	Cpu Time
1	-56.05%	-56.04%
2	-43.60%	-43.61%
4	-12.39%	-12.39%
8	-33.83%	-33.82%
16	-34.88%	-34.88%
32	-57.11%	-57.10%
64	-64.02%	-64.02%
128	-33.04%	-33.04%
256	-8.39%	-8.39%
512	-8.97%	-8.95%
1024	8.22%	8.22%
2048	-21.33%	-21.31%
4096	-6.15%	-6.14%
8192	-4.56%	-4.54%
16384	-0.06%	-0.06%
32768	-0.07%	-0.07%
65536	-0.02%	-0.02%
131072	-0.12%	-0.12%
262144	-0.13%	-0.13%
524288	1.97%	1.97%
1048576	-0.05%	-0.03%

However, users on discord reported that MSVC's fill_n is terrible and is slower than our assignment logic. The hot spot in fill_n for these users is a branch that determines if the value is is all zero bits. If so, fill_n switches to memset. I figure we should directly use memset without branching if the user calls setZero(). The requirements to use memset are the same as above, with the additional requirement that the type is arithmetic (a built-in scalar, essentially). The gains are modest to size 16, then almost nothing. My real intent was to provide MSVC users with a decent optimization for setZero since fill_n is no good.

Benefit of using memset in setZero (clang++-18 -DNDEBUG -O3)

Size	Real Time	CPU Time
1	-11.10%	-11.10%
2	-12.59%	-12.57%
4	-14.37%	-14.37%
8	-10.96%	-10.96%
16	-10.95%	-10.93%
32	-0.01%	-0.01%
64	0.01%	0.04%
128	-0.89%	-0.89%
256	0.12%	0.14%
512	0.25%	0.25%
1024	-0.09%	-0.09%
2048	-1.02%	-1.02%
4096	-0.02%	0.01%
8192	-0.16%	-0.16%
16384	-0.18%	-0.18%
32768	0.03%	0.06%
65536	-0.05%	-0.05%
131072	-0.01%	0.01%
262144	0.08%	0.08%
524288	5.03%	5.05%
1048576	-0.41%	-0.41%

Additional information

test:

#include <benchmark/benchmark.h>
#include <Eigen/Core>

using namespace Eigen;

template <typename T>
static void set_constant(benchmark::State& state) {
  int n = state.range(0);
  Eigen::Matrix<T, Eigen::Dynamic, 1> v(n);
  auto v_map = v.head(n);
  v.setRandom();
  const T rando = internal::random<T>();
  for (auto s : state) {
    benchmark::DoNotOptimize(v_map.setConstant(rando));
  }
}

template <typename T>
static void set_zero(benchmark::State& state) {
  int n = state.range(0);
  Eigen::Matrix<T, Eigen::Dynamic, 1> v(n);
  auto v_map = v.head(n);
  v.setRandom();
  for (auto s : state) {
    benchmark::DoNotOptimize(v_map.setZero());
  }
}

BENCHMARK(set_constant<float>)->RangeMultiplier(2)->Range(1, 1<<20)->Threads(1);
BENCHMARK(set_zero<float>)->RangeMultiplier(2)->Range(1, 1<<20)->Threads(1);
BENCHMARK_MAIN();

Edited Nov 21, 2024 by Charles Schlosser

optimize setConstant, setZero

Reference issue

What does this implement/fix?

Additional information

Merge request reports