Simd sincos double
Reference issue
This is another step in the SIMD implementation campaign #2635. This MR involves the vectorized implementation of sine and cosine.
What does this implement/fix?
This is a draft of the sincos implementation with double precision. I used Veltkamp method to split pi/2 into four values for the argument reduction. I used a Padé approximant with an order guaranteeing double precision in the whole [-pi/4, pi/4] domain for the polynomial approximation.
Regarding numerical results, the implemented approach reverts to using std::sin and std::cos for inputs larger than 1e14.
Any comments or ideas are welcome!
Benchmark results
For benchmarking, I used this code
void BM_vec_f_cos(benchmark::State& state) {
Eigen::VectorXd x;
x.setRandom(state.range(0));
for (auto s : state) {
x = x.array().cos();
}
}
// Register the function as a benchmark
BENCHMARK(BM_vec_f_cos)->Range(1, 1 << 20);
For reference, I computed the timing for the float version (already in the code) and the double version (implemented in the current MR). Here's the result
##DOUBLE
# NOT Vectorized
BM_vec_f_cos/1 13.7 ns 13.7 ns 53268784
BM_vec_f_cos/8 56.8 ns 56.7 ns 12128308
BM_vec_f_cos/64 444 ns 443 ns 1563514
BM_vec_f_cos/512 3553 ns 3545 ns 195555
BM_vec_f_cos/4096 28355 ns 28283 ns 24766
BM_vec_f_cos/32768 228040 ns 227619 ns 3046
BM_vec_f_cos/262144 1811000 ns 1807908 ns 371
BM_vec_f_cos/1048576 7370421 ns 7357243 ns 84
# SSE4.1
BM_vec_f_cos/1 14.5 ns 14.5 ns 50010744
BM_vec_f_cos/8 65.1 ns 64.9 ns 10544386
BM_vec_f_cos/64 515 ns 513 ns 1360944
BM_vec_f_cos/512 4049 ns 4038 ns 173452
BM_vec_f_cos/4096 32180 ns 32097 ns 21695
BM_vec_f_cos/32768 257746 ns 257044 ns 2714
BM_vec_f_cos/262144 2011084 ns 2005974 ns 336
BM_vec_f_cos/1048576 8049604 ns 8028964 ns 89
# AVX2
BM_vec_f_cos/1 15.0 ns 15.0 ns 37314589
BM_vec_f_cos/8 47.3 ns 47.1 ns 14208618
BM_vec_f_cos/64 361 ns 360 ns 1895361
BM_vec_f_cos/512 2897 ns 2889 ns 243382
BM_vec_f_cos/4096 22173 ns 22118 ns 31485
BM_vec_f_cos/32768 178418 ns 177959 ns 3932
BM_vec_f_cos/262144 1391234 ns 1387836 ns 487
BM_vec_f_cos/1048576 5677948 ns 5663670 ns 127
# AVX512
BM_vec_f_cos/1 14.0 ns 14.0 ns 49550612
BM_vec_f_cos/8 27.2 ns 27.1 ns 25774820
BM_vec_f_cos/64 101 ns 101 ns 6903473
BM_vec_f_cos/512 813 ns 812 ns 876030
BM_vec_f_cos/4096 6535 ns 6523 ns 109188
BM_vec_f_cos/32768 53296 ns 53206 ns 13487
BM_vec_f_cos/262144 440815 ns 439662 ns 1609
BM_vec_f_cos/1048576 1740265 ns 1735814 ns 391
##FLOAT
#NON-VECTORIZED
BM_vec_f_cos/1 10.3 ns 10.3 ns 62599112
BM_vec_f_cos/8 23.9 ns 23.8 ns 28803165
BM_vec_f_cos/64 191 ns 191 ns 3608446
BM_vec_f_cos/512 1500 ns 1495 ns 469985
BM_vec_f_cos/4096 11913 ns 11879 ns 59246
BM_vec_f_cos/32768 100809 ns 100460 ns 7526
BM_vec_f_cos/262144 891223 ns 887243 ns 709
BM_vec_f_cos/1048576 2912187 ns 2905854 ns 219
#AVX2
BM_vec_f_cos/1 11.3 ns 11.3 ns 61839107
BM_vec_f_cos/8 23.4 ns 23.3 ns 29502403
BM_vec_f_cos/64 90.1 ns 89.8 ns 8276673
BM_vec_f_cos/512 766 ns 764 ns 947448
BM_vec_f_cos/4096 5781 ns 5764 ns 120739
BM_vec_f_cos/32768 57926 ns 57683 ns 15030
BM_vec_f_cos/262144 326851 ns 325917 ns 2104
BM_vec_f_cos/1048576 1262296 ns 1258357 ns 536
#SSE4.1
BM_vec_f_cos/1 10.9 ns 10.8 ns 64594625
BM_vec_f_cos/8 22.7 ns 22.6 ns 31031063
BM_vec_f_cos/64 163 ns 162 ns 3969898
BM_vec_f_cos/512 1327 ns 1322 ns 476229
BM_vec_f_cos/4096 10240 ns 10204 ns 69161
BM_vec_f_cos/32768 82552 ns 82252 ns 8256
BM_vec_f_cos/262144 619372 ns 617440 ns 1025
BM_vec_f_cos/1048576 2422650 ns 2416255 ns 288
On the new double implementation, the SSE version barely makes it for large vectors, but the situation isn't much different for float, although it is slightly better. AVX performs around 1.5x.
Indeed, there is room for improvement.