Partial vectorization leads to slow(er) code (Regression from Eigen 3.2)

Summary

The attached example leads to roughly 3x slower performance in Eigen 3.3 (and Eigen 3.4-rc1) compared to the 3.2 branch. The example mainly accesses individual elements and performs computations with them, with a single squaredNorm of a 3-vector segment.

Environment

Operating System : macOS 11.2.3 Big Sur (but similar behavior has been observed on Linux)
Architecture : x64
Eigen Version : 3.3.X, 3.4-rc1
Compiler Version : clang 11.1.0, gcc 10.2.0
Compile Flags : -std=c++17 -Ofast -march=native -fno-finite-math-only -fno-stack-protector
Vector Extension : AVX2 FDPEO SMEP BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI

Minimal Example

eigen_bench_augment.cpp

Steps to reproduce

Compile the included benchmark, e.g. clang++ -std=c++17 -Ofast -march=native -fno-finite-math-only -fno-stack-protector eigen_bench_augment.cpp -I EIGEN_INC
Run the binary, which will print some timing information
Repeat the above with Eigen 3.2 and contrast with Eigen 3.3 (or 3.4-rc1).
Try 3.3 or 3.4 without partial vectorization -DEIGEN_UNALIGNED_VECTORIZE=0.

What is the current bug behavior?

Performance is slower than expected with Eigen 3.3 and 3.4-rc1. On my system 3.3/4 with partial vectorization enabled (which is the default) is roughly 3x slower than Eigen 3.2 (or disabling partial vectorization).

What is the expected correct behavior?

Partial vectorization should only be used in cases where it is profitable. Disabling it should not lead to a speed-up.

Anything else that might help

To me, the generated code doesn't look 3x slower, so I'm not sure what's going on. I see two more uses of FMA in the Eigen 3.2 assembly but that also shouldn't lead to such a large difference.

As mentioned before, Eigen 3.2 (which didn't support partial vectorization) didn't have this issue.