Skip to content

Performance improvements in GEMM for Power

Added vector_pair loads for LHS of GEMM for MMA (10% faster) An extra accumulator for extra_row of GEMM for MMA & VSX (non-vectorized right portion of the matrix executes for essentially free in almost all cases - 1200% / number of columns eliminated) Single pass for extra_col of GEMM for VSX (bottom of the matrix executes in a single pass versus up to 3 passes - 2400% / number of rows faster). Other minor performance changes.

Merge request reports

Loading