Added partial linear access for LHS & Output - 30% faster for bfloat16 GEMM MMA (Power)

Added partial linear access for LHS & Output - 30% faster (1/3 less memory loads). Fixed bfloat16 MMA GEMM to follow disable MMA flag.

Edited by Chip Kerchner

Merge request reports

Loading