`coeffMax(Index*)` is very slow because Visitor.h is not vectorized
I discovered this while implementing Rook pivoting in FullPivLU. It turns out that the pivot search, which consists of the expression
biggest_in_corner = m_lu.bottomRightCorner(rows-k, cols-k)
.cwiseAbs()
.maxCoeff(&row_of_biggest_in_corner, &col_of_biggest_in_corner);
takes over 4 times as long as the matrix updates themselves. With debugging help from @sarah.elkazdadi it turned out that this computation is entirely scalar.
As a proxy for the possible improvement from fixing this, consider the effect of using rook pivoting, which only scans a few rows and columns instead of the entire trailing matrix. This gives a dramatic speedup with AVX2:
name old cpu/op new cpu/op delta
BM_EigenFullPivLU/16 3.67µs ± 1% 2.65µs ± 1% -27.77% (p=0.000 n=49+54)
BM_EigenFullPivLU/64 96.8µs ± 3% 33.9µs ± 2% -64.99% (p=0.000 n=60+49)
BM_EigenFullPivLU/128 698µs ± 3% 164µs ± 2% -76.51% (p=0.000 n=57+59)
BM_EigenFullPivLU/512 39.6ms ± 2% 5.5ms ± 4% -86.05% (p=0.000 n=40+60)
BM_EigenFullPivLU/1k 327ms ± 3% 66ms ± 5% -79.95% (p=0.000 n=10+41)
Edited by Rasmus Munk Larsen