Improve performance for Power10 MMA bfloat16 GEMM
Improve performance for Power10 MMA bfloat16 GEMM.
Includes packing for rank-2 friendly data, better indexing variables, elimination of MMA masking, improved edge handling, hardware bfloat16 conversions, fixes slowdown with LLVM, use of LinearMappers, general cleanup, etc.
It is now up to 61X faster than generic GEMM code and 2.3X faster for GCC & 7-12X for LLVM than previous version.
Edited by Chip Kerchner