Fix slowdown in bfloat16 MMA when rows is not a multiple of 8 or columns is not a multiple of 4.
Fixed significant slowdown with bfloat16 MMA when rows is not a multiple of 8 or columns is not a multiple of 4 - 50% slower for columns (RHS) and 5-6% for rows (LHS). Required rewriting of packing and processing (only in extra areas).
Packing was RowMajor in extra areas. Change to be ColMajor so that a simple pload_partial can be used instead of element by element creation of the packet.
Edited by Chip Kerchner