Add inline hint to help Clang and NVHPC gain performance

Describe the feature you would like to be implemented.

I've been running benchmark runs using the various benchmarks under bench/btl. I've noticed that Clang and NVHPC could get some 10-20% speedup on some of the benchmarks for smaller problem sizes (the smaller end of the problem size range that each benchmark covers by default), if I add __attribute__((always_inline)) to some of the critical function calls. This basically forces LLVM to inline those calls.

Some of the functions that I found this helps are

here
here

Some of the benchmarks that I found this helps are ata, axpby, axpy, cholesky, matrix_matrix, partial_lu_decomp, syr2, tridiagonalization, trisolve_matrix and trisolve_vector.

Why Would such a feature be useful for other users?

Everyone who use those two compilers and work on small problem sizes would get a speedup.

Any hints on how to implement the requested feature?

One of the ways to do this is to update the definition of EIGEN_STRONG_INLINE here by adding

+#elif (EIGEN_COMP_CLANG || EIGEN_COMP_NVHPC)
+#define EIGEN_STRONG_INLINE __attribute__((always_inline)) inline

This does change the behavior of both EIGEN_STRONG_INLINE and EIGEN_ALWAYS_INLINE for both compilers, but it keeps the behavior of all other compilers.

Additional resources

I have benchmark results with the above change that I can share.