AVX512 Optimizations for Triangular Solve
What does this implement/fix?
This MR includes optimized AVX512 kernels to improve fp32/fp64 Triangular Solve performance. These kernels are "nocopy" (i.e matrices are not packed, GEBP not used) and are meant to get better performance for smaller problem sizes (only inner strides of 1 are supported). The existing generic implementation of this functionality was pulled into separate functions trsmKernelL/trsmKernelR to allow dropping in the optimized versions when needed in TriangularSolverMatrix.h.
Changes:
-
Eigen/src/Core/products/TriangularSolverMatrix.h: Replaced the previous inner triangular solve loop with a wrapper to the original implementation (trsmKernelL,trsmKernelR). -
Eigen/src/Core/arch/AVX512/trsmKernel_impl.hpp: The optimized kernels are implemented here. Template specializations to fp32/fp64trsmKernelLandtrsmKernelRare here as well.- The solve kernel,
trisolve, solvesAX=BwhereAisMxMtriangular, andBisMxN.AandBcan be row/col-major andAcan be upper/lower triangular. Combinations of these layouts can be used to handle the cases whereAis on the right. -
gemm_MNK__is used to update panels ofBand computesC -= A*B. This can be reused for Matrix Multiply optimizations (smaller sizes for certain transpose cases) - Both these kernels use various unrolls which are generated recursively using templates.
- For small/medium sizes the solve kernel (built with clang) is generally faster and is used directly for the entire problem. TODO: improve heuristics for determining when to use kernels directly (current cutoffs determined from quick benchmarking).
- Note: we have noticed increases in compile time as a result of these changes.
- The solve kernel,
Additional information
Here are some performance results of fp32/fp64 triangular solve with the optimized kernels. The charts are for the RUN (right, upper, non-transposed) and LLN (left, lower, non-transposed) trsm cases. The metric is flops/cycle, measured on Intel(R) Xeon(R) Gold 6336Y (peak is 64 flops/cycle for fp32, 32 flops/cycle for fp64). Compilers used were g++/clang++ with versions 8.4.1 and 11.0.0 respectively.
For the RUN case, the data in the matrices are organized in the most optimal way (both A/B are row-major) so this provides the best performance of the 8 cases. For the LLN case, we do intermediate transposes so performance here is not as great. For large problem sizes triangular solve performance is entirely dependent on GEBP performance. Currently GNU compilers are generating sub-optimal code for the gemm micro kernel. We are seeing some register spilling not present in clang (this is mentioned in comments in the code). This only impacts performance for smaller sizes, for larger sizes performance using either compilers were similar.



