Add dynamic dispatch to BF16 GEMM (Power) and new VSX version
Add dynamic dispatch to BF16 GEMM (Power) and new VSX version - 13.4X faster than original generic code, 1.36X faster than F32 GEMM (non-MMA).
Lots of other fixes and improvements -
- Many conversions from BF16 <-> F32.
- Improve dynamic dispatch code for all cases.
- Hardware conversion for P10 in vector F32->BF16.
- Simplify partial packet loads and stores for early processors.
- Improve software conversion in vector F32->BF16 - up to 40% faster.
- Disabled subnormal calculations since none of the other architectures have it.
- Fix compilation issues and make code consistent.
- Generic code is 1.84X faster due to improved vector conversions.
Edited by Chip Kerchner