Add a fast stack-like allocator for tempories

Submitted by Gael Guennebaud @ggael

Assigned to Nobody

Link to original bugzilla bug (#1568)
Version: 3.4 (development)

Description

This is a feature I've in mind for a while and that I've mentioned at few occasions in the past. Basically, the idea would be to implement our own stack for allocating temporaries with dynamic sizes, and thus save a huge amount of running time for small matrices.

For temporaries that are local to a function, then alloca + Map<> should be enough, as I just implemented there:

de9e31a0

In theory, it only remains to replace

nested_eval<T,N>::type V(X)

by

ei_declare_local_nested_eval(T,X,N,V)

:)

But some temporaries require a longer lifetime, typically the one created in the ctor of the evaluators. One possibility would be to keep using alloca by delegating the allocation in the function that creates/destroys the evaluator (e.g. in the assignment) and then propagate the allocated buffer down to the tree:

evaluator<Src> src_eval(src);
void* buffer = alloca(src_eval.nbTemporyBytes());
src_eval.init(buffer);
for(i=...)
dst_eval.coeffRef(i) = src_eval.coeff(i);

nbTemporyBytes() would sum up the requires bytes, and init(void* &buffer) would pass the remaining buffer to the leaves, e.g.:

binary_evaluator::init(void* &buffer) {
lhs_eval.init(buffer);
rhs_eval.init(buffer);
}

and an allocator that requested a temporary buffer would do:

product_evaluator::init(void* &buffer) {
::new(&m_tmp) (buffer, rows, cols);
generic_product_impl(m_tmp, lhs, rhs);
buffer += sizeof(Scalar)rowscols;
}

Of course if the total amount of requested memory is too large, then we would allocate on the heap, but only once, which is still a win if there are multiple temporaries.

Another option would be to implement our own stack-based allocator that would have the advantages to be cross-platform and that would possibly be much larger than the stack. It could even be dynamically expanded to amortize heap allocations across expressions. Another advantage is that the above three pass approach (ctor/nbTemporyBytes/init) could be avoided by allocating directly in the ctors.

The main difficulty of this approach is multi-threading for which I have no clue how to deal with.

Blocking

#1364

Edited by Eigen Bugzilla