Speed up StableNorm for non-trivial sizes and improve consistency between aligned and unaligned inputs.

Fixes #2847 (closed).

I measured the performance using the benchmark code provided by @cantonios in !1460 (merged).

Benchmark measurements show a significant speedup: SSE: https://gitlab.com/libeigen/eigen/-/snippets/3737877 AVX2: $3737870

Edited by Rasmus Munk Larsen

Merge request reports

Loading