Vectorize cbrt for float and double.

The implementation is accurate to 1 ULP for double and 2 ULPs for float.

This change also adds (non-vectorized) numext::cbrt for complex types.

Benchmark measurements for AVX2:

name                       old cpu/op   new cpu/op   delta
BM_eigen_cbrt_double/1     3.00ns ± 1%  2.46ns ± 0%  -18.15%  (p=0.000 n=50+52)
BM_eigen_cbrt_double/8      143ns ± 0%    66ns ± 1%  -53.96%  (p=0.000 n=54+49)
BM_eigen_cbrt_double/64    1.28µs ± 0%  0.32µs ± 8%  -74.92%  (p=0.000 n=56+44)
BM_eigen_cbrt_double/512   10.4µs ± 0%   2.3µs ± 7%  -77.45%  (p=0.000 n=59+45)
BM_eigen_cbrt_double/4k    83.5µs ± 0%  19.0µs ±11%  -77.22%  (p=0.000 n=56+59)
BM_eigen_cbrt_double/32k    669µs ± 0%   148µs ± 2%  -77.82%  (p=0.000 n=58+48)
BM_eigen_cbrt_double/256k  5.39ms ± 0%  1.18ms ± 3%  -78.05%  (p=0.000 n=58+48)
BM_eigen_cbrt_double/1M    21.6ms ± 1%   4.8ms ± 4%  -78.00%  (p=0.000 n=52+50)
BM_eigen_cbrt_float/1      3.00ns ± 1%  1.91ns ± 0%  -36.26%  (p=0.000 n=51+55)
BM_eigen_cbrt_float/8      48.1ns ± 1%  47.2ns ± 0%   -1.88%  (p=0.000 n=51+53)
BM_eigen_cbrt_float/64      423ns ± 1%   151ns ± 4%  -64.38%  (p=0.000 n=52+60)
BM_eigen_cbrt_float/512    3.38µs ± 0%  0.90µs ± 6%  -73.33%  (p=0.000 n=44+59)
BM_eigen_cbrt_float/4k     27.1µs ± 0%   7.0µs ± 9%  -74.15%  (p=0.000 n=45+59)
BM_eigen_cbrt_float/32k     217µs ± 1%    55µs ± 5%  -74.42%  (p=0.000 n=52+53)
BM_eigen_cbrt_float/256k   1.73ms ± 0%  0.44ms ± 6%  -74.43%  (p=0.000 n=53+57)
BM_eigen_cbrt_float/1M     6.94ms ± 1%  1.77ms ± 6%  -74.42%  (p=0.000 n=55+57)
Edited by Rasmus Munk Larsen

Merge request reports

Loading