Proposal about optimized AssignEvaluator

I am currently looking at the code in AssignEvaluator.h and have ideas for improvement, but would like to hear your opinions before spending more time on it. Basically, the current assignment logic does not generate optimal code in several cases, especially when the exact size of the destination is not known at compile time. However, the logic could take the maximum size of the destination into account (if known at compile time, e.g. here: https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/Core/AssignEvaluator.h#L54 ), or could take the size of the source into account (currently not done at all in copy_using_evaluator_traits, not sure why?). Possible effects of the current code are:

vector instructions (e.g. SSE, AVX) are generated but cannot be used at runtime because the number of elements to copy is too small. Either no vector instructions can be used (e.g. only 3 floats to copy) or vector instructions could be used, but not the ones chosen by the code (e.g. 4 floats to copy but code chooses instructions that copy in packets of 8). So some unnecessary for-loops have to be skipped at runtime, and sometimes elements could be copied in packets but aren't
as a result of the previous point, in some cases compiler warnings like these are generated: #2506 (closed) they are indeed false-positives in the sense that at runtime, no out-of-bounds error happens because the generated vector instruction is skipped. But also, GCC is not completely wrong about showing the warning because generating the vector instruction could be avoided completely, all the necessary information is there at compile time
sometimes, loops could be unrolled but aren't. Though I am still in the process of understanding the unrolling logic

Example:

// suggested compile command: g++ eigen_assignment.cpp -o eigen_assignment -msse -mavx -O3 -Wall -Wextra -I/path/to/eigen
#define EIGEN_DEBUG_ASSIGN 1
#include <Eigen/Dense>
#include <iostream>
int main() {
#if 0 // switch between examples
    Eigen::Array<float, 3, 1> a;
    a.setRandom();
    
    Eigen::Array<float, -1, 1> b;
    b.setRandom(3, 1);
    b = a; // Chooses LinearVectorizedTraversal with packet size 8, but only 3 floats have to be copied (see a, which is source).
           // Should actually choose LinearTraversal.
#else
    Eigen::Array<double, 2, 1> a;
    a.setRandom();

    Eigen::Array<double, 3, 1> b;
    b.setRandom();
    b.tail(2) = a; // Chooses LinearVectorizedTraversal with packet size 4, but only 2 doubles have to be copied.
                   // Max size of destination b is 3. GCC emits array-bounds warnings.
                   // Should actually choose LinearVectorizedTraversal with packet size 2.
#endif
    return 0;
}

To be clear, no runtime errors like out-of-bound accesses are happening (that I am aware of), but currently, the generated code is inefficient and generates compiler warnings. I am willing to improve this in a merge request, but would first like to hear your opinions. Is there any reason why the logic in copy_using_evaluator_traits currently does not use the maximum size of the destination and the size/maximum size of the source (if known at compile time)?