`coeffMax(Index*)` is very slow because Visitor.h is not vectorized

I discovered this while implementing Rook pivoting in FullPivLU. It turns out that the pivot search, which consists of the expression

biggest_in_corner = m_lu.bottomRightCorner(rows-k, cols-k)
                        .cwiseAbs()
                        .maxCoeff(&row_of_biggest_in_corner, &col_of_biggest_in_corner);

takes over 4 times as long as the matrix updates themselves. With debugging help from @sarah.elkazdadi it turned out that this computation is entirely scalar.

As a proxy for the possible improvement from fixing this, consider the effect of using rook pivoting, which only scans a few rows and columns instead of the entire trailing matrix. This gives a dramatic speedup with AVX2:

name                   old cpu/op  new cpu/op  delta
BM_EigenFullPivLU/16   3.67µs ± 1%  2.65µs ± 1%  -27.77%  (p=0.000 n=49+54)
BM_EigenFullPivLU/64   96.8µs ± 3%  33.9µs ± 2%  -64.99%  (p=0.000 n=60+49)
BM_EigenFullPivLU/128   698µs ± 3%   164µs ± 2%  -76.51%  (p=0.000 n=57+59)
BM_EigenFullPivLU/512  39.6ms ± 2%   5.5ms ± 4%  -86.05%  (p=0.000 n=40+60)
BM_EigenFullPivLU/1k    327ms ± 3%    66ms ± 5%  -79.95%  (p=0.000 n=10+41)
Edited by Rasmus Munk Larsen