Implement faster GEMM kernel for AVX512

Submitted by wil..@..il.com

Assigned to Nobody

Link to original bugzilla bug (#1642)
Version: 3.4 (development)
Platform: x86 - AVX

Description

Created attachment 907
bench_matrix_vs_tensor.cpp

Good morning all
I m benchmarking Eigen master matmul on AWS C5 :
https://aws.amazon.com/ec2/instance-types/c5/
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat

The speed up of AVX512f compared to AVX/AVX2 is not that good :

ubuntu@ip-172-30-0-228:/eigen-eigen-cf697272/unsupported/bench$ ./buildrun.sh "-mavx2 -mfma"
g++-6 (Ubuntu 6.5.0-2ubuntu116.04) 6.5.0 20181026
Bench Eigen Matrix vs Tensor
Usage: program numberOfEigenThreads (default to 1)
GCC: 6.5.0 20181026
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen::nbThreads: 1
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1
Matmul: M=N=K
Repeat: 10
MNK EMatrix ETensor
256 0.00510111 0.00469228
512 0.0387391 0.0315717
1024 0.272151 0.263714
2048 2.24269 2.22049

ubuntu@ip-172-30-0-228:/eigen-eigen-cf697272/unsupported/bench$ ./buildrun.sh "-mavx512f -mavx512cd -mfma"
g++-6 (Ubuntu 6.5.0-2ubuntu116.04) 6.5.0 20181026
Bench Eigen Matrix vs Tensor
Usage: program numberOfEigenThreads (default to 1)
GCC: 6.5.0 20181026
Eigen version: 3.3.90
Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen::nbThreads: 1
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1
Matmul: M=N=K
Repeat: 10
MNK EMatrix ETensor
256 0.00408185 0.00253315
512 0.0228611 0.0203946
1024 0.194071 0.170996
2048 1.52432 1.51736

Is it expected ?
Kind
WT

Attachment 907, "bench_matrix_vs_tensor.cpp":
bench_matrix_vs_tensor.cpp

Depends on

#1633 (closed)

Blocking

#1608

Edited Dec 05, 2019 by Eigen Bugzilla

Admin message

Implement faster GEMM kernel for AVX512

Submitted by wil..@..il.com

Description

Depends on

Blocking