Speed up StableNorm for non-trivial sizes and improve consistency between aligned and unaligned inputs.
Fixes #2847 (closed).
I measured the performance using the benchmark code provided by @cantonios in !1460 (merged).
Benchmark measurements show a significant speedup: SSE: https://gitlab.com/libeigen/eigen/-/snippets/3737877 AVX2: $3737870
Edited by Rasmus Munk Larsen