Skip to content

Add reciprocal packet op and fast specializations for float with SSE, AVX, and AVX512.

Add reciprocal packet op and fast specializations for float with SSE and AVX, which have builtin instructions for approximate reciprocal. The approximation is refined by one step of Newton-Raphson iteration. The result is accurate to 2 ulps for SSE/AVX and within 1 ulp for AVX512, where the _mm512_rcp14_ps instruction provides a better starting guess.

TODO: Add specializations for more ISAs with fast approximate reciprocal instructions.

Benchmark numbers measured on Intel Xeon Gold 6154 (Skylake):

AVX512 (packet size 16)

name                         old cpu/op  new cpu/op  delta
BM_eigen_inverse_float/1     2.72ns ± 0%  0.61ns ± 2%  -77.39%  (p=0.000 n=53+59)
BM_eigen_inverse_float/8     5.73ns ± 1%  6.25ns ± 2%   +9.10%  (p=0.000 n=58+60)
BM_eigen_inverse_float/64    31.2ns ± 2%  13.4ns ± 2%  -56.96%  (p=0.000 n=51+60)
BM_eigen_inverse_float/512    115ns ± 2%    43ns ± 3%  -62.38%  (p=0.000 n=59+57)
BM_eigen_inverse_float/4k     781ns ± 2%   290ns ± 2%  -62.88%  (p=0.000 n=60+53)
BM_eigen_inverse_float/32k   6.12µs ± 2%  2.94µs ± 3%  -51.99%  (p=0.000 n=59+48)
BM_eigen_inverse_float/256k  80.1µs ± 2%  81.2µs ± 2%   +1.28%  (p=0.000 n=60+56)
BM_eigen_inverse_float/1M     321µs ± 2%   324µs ± 1%   +0.91%  (p=0.000 n=33+29)


AVX (packet size 8):
name                         old cpu/op  new cpu/op  delta
BM_eigen_inverse_float/1     2.72ns ± 0%  3.28ns ± 1%  +20.53%  (p=0.000 n=54+45)
BM_eigen_inverse_float/8     5.72ns ± 0%  6.65ns ± 0%  +16.21%  (p=0.000 n=56+56)
BM_eigen_inverse_float/64    19.0ns ± 0%  12.6ns ± 2%  -33.75%  (p=0.000 n=58+48)
BM_eigen_inverse_float/512   95.1ns ± 0%  50.7ns ± 4%  -46.65%  (p=0.000 n=52+55)
BM_eigen_inverse_float/4k     704ns ± 0%   368ns ± 2%  -47.65%  (p=0.000 n=56+50)
BM_eigen_inverse_float/32k   5.57µs ± 0%  3.47µs ± 3%  -37.75%  (p=0.000 n=57+50)
BM_eigen_inverse_float/256k  78.2µs ± 1%  80.7µs ± 2%   +3.29%  (p=0.000 n=59+58)
BM_eigen_inverse_float/1M     313µs ± 1%   323µs ± 1%   +3.42%  (p=0.000 n=33+33)

SSE (packet size 4):
name                         old cpu/op  new cpu/op  delta
BM_eigen_inverse_float/1     0.83ns ± 1%  0.83ns ± 0%   -0.10%  (p=0.006 n=56+50)
BM_eigen_inverse_float/8     3.26ns ± 0%  2.45ns ± 0%  -24.68%  (p=0.000 n=48+50)
BM_eigen_inverse_float/64    14.7ns ± 0%  11.1ns ± 1%  -24.48%  (p=0.000 n=53+54)
BM_eigen_inverse_float/512    106ns ± 0%    86ns ± 0%  -18.47%  (p=0.000 n=55+55)
BM_eigen_inverse_float/4k     835ns ± 0%   640ns ± 0%  -23.44%  (p=0.000 n=55+54)
BM_eigen_inverse_float/32k   6.67µs ± 0%  5.81µs ± 0%  -12.83%  (p=0.000 n=51+56)
BM_eigen_inverse_float/256k  78.4µs ± 2%  79.0µs ± 1%   +0.71%  (p=0.000 n=55+54)
BM_eigen_inverse_float/1M     313µs ± 1%   316µs ± 1%   +0.88%  (p=0.000 n=29+30)

Thanks to @sandwichmaker for reviewing a preliminary version of this at Google.

Edited by Rasmus Munk Larsen

Merge request reports

Loading