New panel modes for GEMM MMA (real & complex).
New panel modes for GEMM MMA (real & complex). Better register usage and pipeline.
Up to 2.84X faster for small matrices. 34% faster for F32 MMA real-only, 75% for F64 MMA real-only - large matrices. 48% faster for F32 MMA complex, 32% for F64 MMA complex - large matrices. Up to 20% better performance for packing.
Some other fixes for various compilers.