AVX512 TRSM kernels use alloca if EIGEN_NO_MALLOC requested

What does this implement/fix?

Follow-up PR to address comments in !992 (merged). In that PR, LHS variants of TRSM kernels are disabled if EIGEN_NO_MALLOC is requested. In particular the use of alloca was suggested here instead of completely disabling the LHS variant AVX512 TRSM kernels.

This PR changes the behaviour as follows:

  • If EIGEN_NO_MALLOC is requested:
    • If max temp workspace size using default blocking sizes is less than EIGEN_STACK_ALLOCATION_LIMIT then use alloca.
    • Otherwise, reduce blocking size up to the minimum supported then use alloca (perf. is still better than generic trsm kernel, see graph below)
    • If max temp workspace size using minimum blocking sizes is still larger than EIGEN_STACK_ALLOCATION_LIMIT then throw assertion.
  • If EIGEN_NO_MALLOC is not requested we use handmade_aligned_malloc

Additional information

There is a noticeable performance hit (see graph below) when using alloca vs malloc, so malloc is still used if allowed.

STRSM_LLN_Performance_Comparison

  • Non-optimized: generic trsm kernels, code-path used when EIGEN_NO_MALLOC is requested (behaviour as of !992 (merged))
  • Min-blocking: AVX512 trsm kernels with minimum required blocking sizes + alloca.
  • Default-blocking: AVX512 trsm kernels with default blocking sizes + alloca.
  • Malloc: Default-blocking: AVX512 trsm kernels with default blocking sizes + malloc.

Merge request reports

Loading