Skip to content

Add MMA to BF16 GEMV - 5.0-6.3X faster (for Power)

Instead of converting from BF16->F32, do operation (madd), convert back F32->BF16 for each instruction, use MMA.

RowMajor is 6.3X faster. ColMajor is 5.0X faster. Both are between 1.3-1.9X faster than F32 GEMV.

Merge request reports

Loading