Add inline hint to help Clang and NVHPC gain performance
Describe the feature you would like to be implemented.
I've been running benchmark runs using the various benchmarks under bench/btl. I've noticed that Clang and NVHPC could get some 10-20% speedup on some of the benchmarks for smaller problem sizes (the smaller end of the problem size range that each benchmark covers by default), if I add __attribute__((always_inline)) to some of the critical function calls. This basically forces LLVM to inline those calls.
Some of the functions that I found this helps are
Some of the benchmarks that I found this helps are ata, axpby, axpy, cholesky, matrix_matrix, partial_lu_decomp, syr2, tridiagonalization, trisolve_matrix and trisolve_vector.
Why Would such a feature be useful for other users?
Everyone who use those two compilers and work on small problem sizes would get a speedup.
Any hints on how to implement the requested feature?
One of the ways to do this is to update the definition of EIGEN_STRONG_INLINE here by adding
+#elif (EIGEN_COMP_CLANG || EIGEN_COMP_NVHPC)
+#define EIGEN_STRONG_INLINE __attribute__((always_inline)) inline
This does change the behavior of both EIGEN_STRONG_INLINE and EIGEN_ALWAYS_INLINE for both compilers, but it keeps the behavior of all other compilers.
Additional resources
I have benchmark results with the above change that I can share.