Inefficient setZero
Describe the feature you would like to be implemented.
Whenever possible, setZero should compile to a call to memset.
Unmodified from https://stackoverflow.com/q/68008696/1918193
#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main()
{
constexpr size_t t = 100000;
constexpr size_t size = 100;
auto t1 = std::chrono::steady_clock::now();
Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> m1;
for (size_t i = 0; i < t; ++i)
{
m1.setZero(size, size);
}
auto t2 = std::chrono::steady_clock::now();
Eigen::Matrix<double, size, size> m2;
for (size_t i = 0; i < t; ++i)
{
m2.setZero();
}
auto t3 = std::chrono::steady_clock::now();
double t12 = static_cast<std::chrono::duration<double>>(t2 - t1).count();
double t23 = static_cast<std::chrono::duration<double>>(t3 - t2).count();
std::cout << "SetZero runtime" << std::endl;
std::cout << "dynamic: " << t12 << " static: " << t23 << std::endl;
return 0;
}
This benchmark is not so good, for instance clang manages to eliminate the second loop completely. However, current gcc does not, so it will do, I'll use g++ -O3 on x86_64.
We notice that setZero is much more efficient for the first matrix, where it compiles to memset, than for the second, where it produces a repetition of many times the following.
movq 8(%rax), %rsi
movq (%rax), %rcx
movsd (%rsi), %xmm0
movq (%rcx), %rcx
unpcklpd %xmm0, %xmm0
movaps %xmm0, 48(%rcx,%rdx)
First, the unrolling makes the function large, which is probably why gcc does not inline it. Forcing the inlining with __attribute__((flatten)) helps, even if it does not produce memset.
Second, looking at dense_assignment_loop::run after gcc optimizations, we get a repetition of
_17 = kernel_4(D)->m_src;
_18 = MEM[(const double &)_17];
_19 = {_18, _18};
_20 = kernel_4(D)->m_dst;
_21 = MEM[(struct evaluator *)_20].m_data;
_25 = ivtmp.302_713 + 16;
_26 = _21 + _25;
MEM[(__m128d * {ref-all})_26] = _19;
_27 = kernel_4(D)->m_src;
_28 = MEM[(const double &)_27];
_29 = {_28, _28};
_30 = kernel_4(D)->m_dst;
_31 = MEM[(struct evaluator *)_30].m_data;
_35 = ivtmp.302_713 + 32;
_36 = _31 + _35;
MEM[(__m128d * {ref-all})_36] = _29;
even if you are not used to this syntax, you may notice that we keep rereading kernel.m_src and the value (0) and rebuilding a vector from it instead of building a vector {0, 0} once and writing it in multiple places. That's because the memory write (of the __m128d {0, 0}) looks to gcc like it might overwrite pretty much anything, including kernel, etc.
Would such a feature be useful for other users? Why?
Shorter, faster generated code is good for everyone.
Any hints on how to implement the requested feature?
- We could have a special path for assigning zero that explicitly calls memset when possible and falls back to the current code otherwise. Or at least a special path when assigning a scalar, so we can take a copy of it in a local variable and make it more obvious to the compiler that it is constant, and possibly build the single packet outside the loop.
- Alternately, we could be less clever and rely on the compiler more by giving it simple, non-unrolled loops so it can optimize them itself. The shorter user code also makes inlining more likely. I don't think that's really the direction Eigen is going.
- We could try to use
__m128dand similar types less and replace them with stricter, less portable versions. Replacing _mm_store_pd with*(__v2df*)to=fromhelps the compiler know that less aliasing is possible. - Some judicious uses of the non-portable
__restrictcan also help, but it may require some care to make sure using it there is valid, and the heavy abstraction may make it hard to find a suitable place to add this qualifier.