AVX512 Optimizations for Triangular Solve

What does this implement/fix?

This MR includes optimized AVX512 kernels to improve fp32/fp64 Triangular Solve performance. These kernels are "nocopy" (i.e matrices are not packed, GEBP not used) and are meant to get better performance for smaller problem sizes (only inner strides of 1 are supported). The existing generic implementation of this functionality was pulled into separate functions trsmKernelL/trsmKernelR to allow dropping in the optimized versions when needed in TriangularSolverMatrix.h.

Changes:

  • Eigen/src/Core/products/TriangularSolverMatrix.h: Replaced the previous inner triangular solve loop with a wrapper to the original implementation (trsmKernelL, trsmKernelR).
  • Eigen/src/Core/arch/AVX512/trsmKernel_impl.hpp: The optimized kernels are implemented here. Template specializations to fp32/fp64 trsmKernelL and trsmKernelR are here as well.
    • The solve kernel, trisolve, solves AX=B where A is MxM triangular, and B is MxN. A and B can be row/col-major and A can be upper/lower triangular. Combinations of these layouts can be used to handle the cases where A is on the right.
    • gemm_MNK__ is used to update panels of B and computes C -= A*B. This can be reused for Matrix Multiply optimizations (smaller sizes for certain transpose cases)
    • Both these kernels use various unrolls which are generated recursively using templates.
    • For small/medium sizes the solve kernel (built with clang) is generally faster and is used directly for the entire problem. TODO: improve heuristics for determining when to use kernels directly (current cutoffs determined from quick benchmarking).
    • Note: we have noticed increases in compile time as a result of these changes.

Additional information

Here are some performance results of fp32/fp64 triangular solve with the optimized kernels. The charts are for the RUN (right, upper, non-transposed) and LLN (left, lower, non-transposed) trsm cases. The metric is flops/cycle, measured on Intel(R) Xeon(R) Gold 6336Y (peak is 64 flops/cycle for fp32, 32 flops/cycle for fp64). Compilers used were g++/clang++ with versions 8.4.1 and 11.0.0 respectively.

For the RUN case, the data in the matrices are organized in the most optimal way (both A/B are row-major) so this provides the best performance of the 8 cases. For the LLN case, we do intermediate transposes so performance here is not as great. For large problem sizes triangular solve performance is entirely dependent on GEBP performance. Currently GNU compilers are generating sub-optimal code for the gemm micro kernel. We are seeing some register spilling not present in clang (this is mentioned in comments in the code). This only impacts performance for smaller sizes, for larger sizes performance using either compilers were similar.

STRSM_RUN

STRSM_LLN

DTRSM_RUN

DTRSM_LLN

+@aaraujom

Edited by b-shi

Merge request reports

Loading