Added partial linear access for LHS & Output - 30% faster for bfloat16 GEMM MMA (Power)
Added partial linear access for LHS & Output - 30% faster (1/3 less memory loads). Fixed bfloat16 MMA GEMM to follow disable MMA flag.
Edited by Chip Kerchner
Added partial linear access for LHS & Output - 30% faster (1/3 less memory loads). Fixed bfloat16 MMA GEMM to follow disable MMA flag.