AVX512 TRSM kernels use alloca if EIGEN_NO_MALLOC requested
What does this implement/fix?
Follow-up PR to address comments in !992 (merged). In that PR, LHS variants of TRSM kernels are disabled if EIGEN_NO_MALLOC is requested. In particular the use of alloca was suggested here instead of completely disabling the LHS variant AVX512 TRSM kernels.
This PR changes the behaviour as follows:
- If
EIGEN_NO_MALLOCis requested:- If max temp workspace size using default blocking sizes is less than
EIGEN_STACK_ALLOCATION_LIMITthen usealloca. - Otherwise, reduce blocking size up to the minimum supported then use
alloca(perf. is still better than generic trsm kernel, see graph below) - If max temp workspace size using minimum blocking sizes is still larger than
EIGEN_STACK_ALLOCATION_LIMITthen throw assertion.
- If max temp workspace size using default blocking sizes is less than
- If
EIGEN_NO_MALLOCis not requested we usehandmade_aligned_malloc
Additional information
There is a noticeable performance hit (see graph below) when using alloca vs malloc, so malloc is still used if allowed.
- Non-optimized: generic trsm kernels, code-path used when
EIGEN_NO_MALLOCis requested (behaviour as of !992 (merged)) - Min-blocking: AVX512 trsm kernels with minimum required blocking sizes +
alloca. - Default-blocking: AVX512 trsm kernels with default blocking sizes +
alloca. - Malloc: Default-blocking: AVX512 trsm kernels with default blocking sizes +
malloc.
