Skip to content

Fix slowdown in bfloat16 MMA when rows is not a multiple of 8 or columns is not a multiple of 4.

Fixed significant slowdown with bfloat16 MMA when rows is not a multiple of 8 or columns is not a multiple of 4 - 50% slower for columns (RHS) and 5-6% for rows (LHS). Required rewriting of packing and processing (only in extra areas).

Packing was RowMajor in extra areas. Change to be ColMajor so that a simple pload_partial can be used instead of element by element creation of the packet.

Edited by Chip Kerchner

Merge request reports

Loading