Skip to content

Use 3px8/2px8/1px8/1x8 gebp_kernel on arm64-neon

Reference issue

#2518 (closed)

What does this implement/fix?

I found that the gebp_kernel used by eigen neon is 3px4/2px4/1px4 by default. This is reasonable on x86(avx/fma) and arm32. However, arm64 neon has 32 registers, and a larger size gebp_kernel can be used to get better data reuse to improve performance.

Therefore, I implement gebp_kernel 3px8/2px8/1px8 on eigen (3px8 24 registers for acc, 3 for lhs, 1 for rhs).

Additional information

benchmark

clang

dgemm

clang11_dgemm_step1

clang11_dgemm_step4

sgemm

clang11_sgemm_step1

clang11_sgemm_step4

hgemm

clang11_hgemm_step1

clang11_hgemm_step4

gcc

dgemm

gcc11_dgemm_step1

gcc11_dgemm_step4

sgemm

gcc11_sgemm_step1

gcc11_sgemm_step4

hgemm

gcc11_hgemm_step1

gcc11_hgemm_step4

platform : Ampere® Altra
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: r3p1
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 8 MiB (128 instances)
L1i: 8 MiB (128 instances)
L2: 128 MiB (128 instances)

Edited by Lianhuang Li

Merge request reports

Loading