Add MMA to BF16 GEMV - 5.0-6.3X faster (for Power)
Instead of converting from BF16->F32, do operation (madd), convert back F32->BF16 for each instruction, use MMA.
RowMajor is 6.3X faster. ColMajor is 5.0X faster. Both are between 1.3-1.9X faster than F32 GEMV.