Add load vector_pairs for RHS of GEMM MMA. Improved predux GEMV.

Add load vector_pairs for RHS of GEMM MMA (10% faster in some situations). Improved predux GEMV - use vectors instead of scalars. General cleanup of GEMV - remove unnecessary typename Index from GEMM, etc.

Merge request reports

Loading