Inefficient setZero

Describe the feature you would like to be implemented.

Whenever possible, setZero should compile to a call to memset.

Unmodified from https://stackoverflow.com/q/68008696/1918193

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

int main()
{
    constexpr size_t t = 100000;
    constexpr size_t size = 100;

    auto t1 = std::chrono::steady_clock::now();
    Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> m1;
    for (size_t i = 0; i < t; ++i)
    {
        m1.setZero(size, size);
    }
    auto t2 = std::chrono::steady_clock::now();
    Eigen::Matrix<double, size, size> m2;
    for (size_t i = 0; i < t; ++i)
    {
        m2.setZero();
    }
    auto t3 = std::chrono::steady_clock::now();

    double t12 = static_cast<std::chrono::duration<double>>(t2 - t1).count();
    double t23 = static_cast<std::chrono::duration<double>>(t3 - t2).count();

    std::cout << "SetZero runtime" << std::endl;
    std::cout << "dynamic: " << t12 << " static: " << t23 << std::endl;
    return 0;
}

This benchmark is not so good, for instance clang manages to eliminate the second loop completely. However, current gcc does not, so it will do, I'll use g++ -O3 on x86_64. We notice that setZero is much more efficient for the first matrix, where it compiles to memset, than for the second, where it produces a repetition of many times the following.

    movq    8(%rax), %rsi
    movq    (%rax), %rcx
    movsd   (%rsi), %xmm0
    movq    (%rcx), %rcx
    unpcklpd        %xmm0, %xmm0
    movaps  %xmm0, 48(%rcx,%rdx)

First, the unrolling makes the function large, which is probably why gcc does not inline it. Forcing the inlining with __attribute__((flatten)) helps, even if it does not produce memset.

Second, looking at dense_assignment_loop::run after gcc optimizations, we get a repetition of

  _17 = kernel_4(D)->m_src;
  _18 = MEM[(const double &)_17];
  _19 = {_18, _18};
  _20 = kernel_4(D)->m_dst;
  _21 = MEM[(struct evaluator *)_20].m_data;
  _25 = ivtmp.302_713 + 16;
  _26 = _21 + _25;
  MEM[(__m128d * {ref-all})_26] = _19;
  _27 = kernel_4(D)->m_src;
  _28 = MEM[(const double &)_27];
  _29 = {_28, _28};
  _30 = kernel_4(D)->m_dst;
  _31 = MEM[(struct evaluator *)_30].m_data;
  _35 = ivtmp.302_713 + 32;
  _36 = _31 + _35;
  MEM[(__m128d * {ref-all})_36] = _29;

even if you are not used to this syntax, you may notice that we keep rereading kernel.m_src and the value (0) and rebuilding a vector from it instead of building a vector {0, 0} once and writing it in multiple places. That's because the memory write (of the __m128d {0, 0}) looks to gcc like it might overwrite pretty much anything, including kernel, etc.

Would such a feature be useful for other users? Why?

Shorter, faster generated code is good for everyone.

Any hints on how to implement the requested feature?

  • We could have a special path for assigning zero that explicitly calls memset when possible and falls back to the current code otherwise. Or at least a special path when assigning a scalar, so we can take a copy of it in a local variable and make it more obvious to the compiler that it is constant, and possibly build the single packet outside the loop.
  • Alternately, we could be less clever and rely on the compiler more by giving it simple, non-unrolled loops so it can optimize them itself. The shorter user code also makes inlining more likely. I don't think that's really the direction Eigen is going.
  • We could try to use __m128d and similar types less and replace them with stricter, less portable versions. Replacing _mm_store_pd with *(__v2df*)to=from helps the compiler know that less aliasing is possible.
  • Some judicious uses of the non-portable __restrict can also help, but it may require some care to make sure using it there is valid, and the heavy abstraction may make it hard to find a suitable place to add this qualifier.