Inefficient setZero

Describe the feature you would like to be implemented.

Whenever possible, setZero should compile to a call to memset.

Unmodified from https://stackoverflow.com/q/68008696/1918193

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

int main()
{
    constexpr size_t t = 100000;
    constexpr size_t size = 100;

    auto t1 = std::chrono::steady_clock::now();
    Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> m1;
    for (size_t i = 0; i < t; ++i)
    {
        m1.setZero(size, size);
    }
    auto t2 = std::chrono::steady_clock::now();
    Eigen::Matrix<double, size, size> m2;
    for (size_t i = 0; i < t; ++i)
    {
        m2.setZero();
    }
    auto t3 = std::chrono::steady_clock::now();

    double t12 = static_cast<std::chrono::duration<double>>(t2 - t1).count();
    double t23 = static_cast<std::chrono::duration<double>>(t3 - t2).count();

    std::cout << "SetZero runtime" << std::endl;
    std::cout << "dynamic: " << t12 << " static: " << t23 << std::endl;
    return 0;
}

This benchmark is not so good, for instance clang manages to eliminate the second loop completely. However, current gcc does not, so it will do, I'll use g++ -O3 on x86_64. We notice that setZero is much more efficient for the first matrix, where it compiles to memset, than for the second, where it produces a repetition of many times the following.

    movq    8(%rax), %rsi
    movq    (%rax), %rcx
    movsd   (%rsi), %xmm0
    movq    (%rcx), %rcx
    unpcklpd        %xmm0, %xmm0
    movaps  %xmm0, 48(%rcx,%rdx)

First, the unrolling makes the function large, which is probably why gcc does not inline it. Forcing the inlining with __attribute__((flatten)) helps, even if it does not produce memset.

Second, looking at dense_assignment_loop::run after gcc optimizations, we get a repetition of

  _17 = kernel_4(D)->m_src;
  _18 = MEM[(const double &)_17];
  _19 = {_18, _18};
  _20 = kernel_4(D)->m_dst;
  _21 = MEM[(struct evaluator *)_20].m_data;
  _25 = ivtmp.302_713 + 16;
  _26 = _21 + _25;
  MEM[(__m128d * {ref-all})_26] = _19;
  _27 = kernel_4(D)->m_src;
  _28 = MEM[(const double &)_27];
  _29 = {_28, _28};
  _30 = kernel_4(D)->m_dst;
  _31 = MEM[(struct evaluator *)_30].m_data;
  _35 = ivtmp.302_713 + 32;
  _36 = _31 + _35;
  MEM[(__m128d * {ref-all})_36] = _29;

even if you are not used to this syntax, you may notice that we keep rereading kernel.m_src and the value (0) and rebuilding a vector from it instead of building a vector {0, 0} once and writing it in multiple places. That's because the memory write (of the __m128d {0, 0}) looks to gcc like it might overwrite pretty much anything, including kernel, etc.

Would such a feature be useful for other users? Why?

Shorter, faster generated code is good for everyone.

Any hints on how to implement the requested feature?

We could have a special path for assigning zero that explicitly calls memset when possible and falls back to the current code otherwise. Or at least a special path when assigning a scalar, so we can take a copy of it in a local variable and make it more obvious to the compiler that it is constant, and possibly build the single packet outside the loop.
Alternately, we could be less clever and rely on the compiler more by giving it simple, non-unrolled loops so it can optimize them itself. The shorter user code also makes inlining more likely. I don't think that's really the direction Eigen is going.
We could try to use __m128d and similar types less and replace them with stricter, less portable versions. Replacing _mm_store_pd with *(__v2df*)to=from helps the compiler know that less aliasing is possible.
Some judicious uses of the non-portable __restrict can also help, but it may require some care to make sure using it there is valid, and the heavy abstraction may make it hard to find a suitable place to add this qualifier.