Implement faster GEMM kernel for AVX512
Submitted by wil..@..il.com
Assigned to Nobody
Link to original bugzilla bug (#1642)
Version: 3.4 (development)
Platform: x86 - AVX
Description
Created attachment 907
bench_matrix_vs_tensor.cpp
Good morning all
I m benchmarking Eigen master matmul on AWS C5 :
https://aws.amazon.com/ec2/instance-types/c5/
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
The speed up of AVX512f compared to AVX/AVX2 is not that good :
ubuntu@ip-172-30-0-228:/eigen-eigen-cf697272/unsupported/bench$ ./buildrun.sh "-mavx2 -mfma"16.04) 6.5.0 20181026
g++-6 (Ubuntu 6.5.0-2ubuntu1
Bench Eigen Matrix vs Tensor
Usage: program numberOfEigenThreads (default to 1)
GCC: 6.5.0 20181026
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen::nbThreads: 1
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1
Matmul: M=N=K
Repeat: 10
MNK EMatrix ETensor
256 0.00510111 0.00469228
512 0.0387391 0.0315717
1024 0.272151 0.263714
2048 2.24269 2.22049
ubuntu@ip-172-30-0-228:/eigen-eigen-cf697272/unsupported/bench$ ./buildrun.sh "-mavx512f -mavx512cd -mfma"16.04) 6.5.0 20181026
g++-6 (Ubuntu 6.5.0-2ubuntu1
Bench Eigen Matrix vs Tensor
Usage: program numberOfEigenThreads (default to 1)
GCC: 6.5.0 20181026
Eigen version: 3.3.90
Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen::nbThreads: 1
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1
Matmul: M=N=K
Repeat: 10
MNK EMatrix ETensor
256 0.00408185 0.00253315
512 0.0228611 0.0203946
1024 0.194071 0.170996
2048 1.52432 1.51736
Is it expected ?
Kind
WT
Attachment 907, "bench_matrix_vs_tensor.cpp":
bench_matrix_vs_tensor.cpp