Vectorize cbrt for float and double.
The implementation is accurate to 1 ULP for double and 2 ULPs for float.
This change also adds (non-vectorized) numext::cbrt for complex types.
Benchmark measurements for AVX2:
name old cpu/op new cpu/op delta
BM_eigen_cbrt_double/1 3.00ns ± 1% 2.46ns ± 0% -18.15% (p=0.000 n=50+52)
BM_eigen_cbrt_double/8 143ns ± 0% 66ns ± 1% -53.96% (p=0.000 n=54+49)
BM_eigen_cbrt_double/64 1.28µs ± 0% 0.32µs ± 8% -74.92% (p=0.000 n=56+44)
BM_eigen_cbrt_double/512 10.4µs ± 0% 2.3µs ± 7% -77.45% (p=0.000 n=59+45)
BM_eigen_cbrt_double/4k 83.5µs ± 0% 19.0µs ±11% -77.22% (p=0.000 n=56+59)
BM_eigen_cbrt_double/32k 669µs ± 0% 148µs ± 2% -77.82% (p=0.000 n=58+48)
BM_eigen_cbrt_double/256k 5.39ms ± 0% 1.18ms ± 3% -78.05% (p=0.000 n=58+48)
BM_eigen_cbrt_double/1M 21.6ms ± 1% 4.8ms ± 4% -78.00% (p=0.000 n=52+50)
BM_eigen_cbrt_float/1 3.00ns ± 1% 1.91ns ± 0% -36.26% (p=0.000 n=51+55)
BM_eigen_cbrt_float/8 48.1ns ± 1% 47.2ns ± 0% -1.88% (p=0.000 n=51+53)
BM_eigen_cbrt_float/64 423ns ± 1% 151ns ± 4% -64.38% (p=0.000 n=52+60)
BM_eigen_cbrt_float/512 3.38µs ± 0% 0.90µs ± 6% -73.33% (p=0.000 n=44+59)
BM_eigen_cbrt_float/4k 27.1µs ± 0% 7.0µs ± 9% -74.15% (p=0.000 n=45+59)
BM_eigen_cbrt_float/32k 217µs ± 1% 55µs ± 5% -74.42% (p=0.000 n=52+53)
BM_eigen_cbrt_float/256k 1.73ms ± 0% 0.44ms ± 6% -74.43% (p=0.000 n=53+57)
BM_eigen_cbrt_float/1M 6.94ms ± 1% 1.77ms ± 6% -74.42% (p=0.000 n=55+57)
Edited by Rasmus Munk Larsen