Add AVX512 optimizations for matrix multiply

Edit

I've refactored the original implementation in this merge request to use packet math and avoid inline asm/intrinsics as much as possible. It also supports double precision.

The changes implement optimizations for compute kernels use 48x8 and 24x8 unrolls for single and double precision respectively. Tail handling is done with powers of 2 when possible, that is, when packing routines (gemm_pack_rhs and gemm_pack_lhs) support it. If not supported, we loop over ones as done before.

The new kernels do not support inner stride different than one for C matrix, hence, we fallback to Eigen's previously used kernels (with nr == 4). We need to make decision at gebp_traits stage such all kernels are compatible to avoid more intrusive changes in other Eigen drivers that use gebp_kernel, gemm_pack_rhs and gemm_pack_lhs.

I've also added a couple macros to reduce register pressure, which is very high: 24 accumulators + 6 registers for load A and 2 for loading B. Using EIGEN_ARCH_AVX512_GEMM_KERNEL_USE_LESS_A_REGS or EIGEN_ARCH_AVX512_GEMM_KERNEL_USE_LESS_B_REGS will reduce register use for A and B by half. Performance lost was not that much (less than 2% for large sizes). For gcc we use 3 register to load A by default, since it was the only way I was able to avoid the zmm register spills.

I've build the tests and run them for the following architectures: SSE2, SSE3, SSSE3, SSE4.1 SSE4.2, AVX, AVX2, AVX512, and AVX512DQ. Other ones need to be checked.

Performance

I've done a simple sweep test for square problem sizes on Xeon 8180 (Skylake) in sequential mode. gcc11 and clang11 were used to compile benchmark code. There are speedups for dgemm (~20%) and sgemm (~15%), but for sgemm there can be some slowdowns for small problem sizes. Below there are some more details on the performance. I've also measured the other transpose case (NT, TN and TT), but results are similar.

According to @b-shi he also saw improvements for trsm with this patch.

dgemm

Performance improvements for "dgemm" seems reasonable around ~20% improvements when comparing to Eigen before changes (c38f91d7). Some small sizes also improved if they are multiples of 2 at least. plot_eigen_new_clang_vs_eigen_old_clang_d_NN_clang_2 plot_eigen_new_gcc_vs_eigen_old_gcc_d_NN_gcc_2

For smaller sizes I didn't see very large regressions. plot_eigen_new_clang_vs_eigen_old_clang_d_NN_clang_1 plot_eigen_new_gcc_vs_eigen_old_gcc_d_NN_gcc_1

sgemm

For "sgemm", I see some speedups as well (up to 15% for clang and a bit more for gcc), but I've also notice some slow downs depending if the size is a multiple of 4 or not and for small sizes. plot_eigen_new_clang_vs_eigen_old_clang_s_NN_clang_4 plot_eigen_new_gcc_vs_eigen_old_gcc_s_NN_gcc_4

As mentioned before non-multiples of 4 have some regressions for smaller problem sizes. Hopefully, the benefits outweighs the regressions enough that will make those changes worth it.

plot_eigen_new_clang_vs_eigen_old_clang_s_NN_clang_1 plot_eigen_new_gcc_vs_eigen_old_gcc_s_NN_gcc_1

For reference here is the performance for multiples of 2 (step = 2): plot_eigen_new_clang_vs_eigen_old_clang_s_NN_clang_2 plot_eigen_new_gcc_vs_eigen_old_gcc_s_NN_gcc_2

This slowdowns can probably be mitigated if we further enable packing to handle the tail with m = 2 for single precision directly instead of looping over ones. For example, for m = 47 tail handling would be 32 + 8 + 4 + 2 + 1 instead of 32 + 8 + 4 + 1 + 1 + 1.

Old stuff before large refactor.

What does this implement/fix?

This implements/adds some optimizations for "sgemm" compute kernels for AVX512. This is still work in progress since it doesn't use packet math yet. However, it will be useful in getting some early feedback on the changes.

Here are some comments/questions worth mentioning:

  1. It will be useful to know if the inline asm used in the kernel (Eigen/src/Core/arch/AVX512/sgemm_kern.hpp) is reasonable/acceptable. In particular, there is some register mapping that was used to avoid gcc register spills. Also, using inline asm for loading A/B elements result in better performance with gcc.

  2. The kernel is quite verbose, but it should be buildable with c++14. It manually unrolls and handle tails with powers of 2, except for m/n equal to 2, where I had to loop around ones such I could reuse the packing kernels. Does it make sense in rewriting the kernel with c++17 as @b-shi did in !834 (merged)?

  3. I'm not really sure if the performance improvements justify changes. It seems the Eigen's "sgemm" performance is quite good. I see about 10% to 15% performance increase for large sizes with the changes. Maybe we will need to use some threshold to dispatch the kernel for larges sizes only to avoid regressions for smaller sizes. Is this acceptable?

  4. I tried to reenable packing with nr = 8 for gemm_pack_rhs, by uncommenting + small changes. It seems to work for matrix multiplication, but I'm not sure if it was commented out for other reasons. Was there a reason?

Additional information

Here is some initial performance measurements on Intel(R) Xeon(R) Platinum 8180 for A/B non-transpose. For clang we can actually remove the register mapping without having register spills, but gcc performance would be lower.

NN using gcc11

plot_Eigen+opt_vs_Eigen-3.4.0_gcc_res4x4x4

NN using clang11

plot_Eigen+opt_vs_Eigen-3.4.0_clang_res4x4x4

+@b-shi

Edited by aaraujom

Merge request reports

Loading