Add more missing vectorized casts for int on x86, and remove redundant unit tests
The removed unit tests were redundant, and worse, some were invoking undefined behavior.
Benchmark measurements for affected operations:
Measured on Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
SSE
BM_cast<double,int>/8 14.0ns ± 0% 10.8ns ± 3% -22.97% (p=0.000 n=16+20)
BM_cast<double,int>/64 21.4ns ± 0% 18.7ns ± 0% -12.55% (p=0.000 n=14+19)
BM_cast<double,int>/512 113ns ± 0% 108ns ± 0% -3.81% (p=0.000 n=16+19)
BM_cast<double,int>/4k 854ns ± 0% 851ns ± 0% -0.41% (p=0.000 n=18+18)
BM_cast<double,int>/32k 6.70µs ± 0% 6.70µs ± 0% ~ (p=0.751 n=19+19)
BM_cast<double,int>/256k 117µs ± 1% 118µs ± 1% +0.59% (p=0.001 n=20+19)
vectorize, AVX
name old cpu/op new cpu/op delta
BM_cast<float,int>/8 13.2ns ± 0% 11.9ns ± 0% -10.13% (p=0.000 n=56+59)
BM_cast<float,int>/64 15.7ns ± 1% 13.4ns ± 0% -14.66% (p=0.000 n=43+58)
BM_cast<float,int>/512 32.6ns ± 2% 31.7ns ±19% -2.75% (p=0.000 n=50+54)
BM_cast<float,int>/4k 203ns ±11% 222ns ± 6% +9.03% (p=0.000 n=60+60)
BM_cast<float,int>/32k 2.65µs ± 2% 2.65µs ± 3% ~ (p=0.150 n=53+53)
BM_cast<float,int>/256k 79.9µs ± 2% 79.8µs ± 1% -0.15% (p=0.040 n=57+60)
BM_cast<int,float>/8 13.2ns ± 0% 11.9ns ± 0% -10.16% (p=0.000 n=55+59)
BM_cast<int,float>/64 15.7ns ± 1% 13.4ns ± 1% -14.64% (p=0.000 n=41+59)
BM_cast<int,float>/512 32.6ns ± 2% 29.8ns ± 2% -8.36% (p=0.000 n=45+52)
BM_cast<int,float>/4k 203ns ±11% 221ns ± 6% +8.87% (p=0.000 n=60+59)
BM_cast<int,float>/32k 2.65µs ± 2% 2.64µs ± 3% ~ (p=0.658 n=55+55)
BM_cast<int,float>/256k 79.8µs ± 2% 79.8µs ± 1% ~ (p=0.650 n=60+59)
BM_cast<double,int>/8 14.0ns ± 1% 10.4ns ± 0% -25.50% (p=0.000 n=25+23)
BM_cast<double,int>/64 18.2ns ± 0% 15.9ns ± 0% -12.76% (p=0.000 n=23+25)
BM_cast<double,int>/512 56.6ns ± 3% 54.7ns ±11% -3.21% (p=0.006 n=25+25)
BM_cast<double,int>/4k 582ns ± 2% 581ns ± 2% ~ (p=0.617 n=24+24)
BM_cast<double,int>/32k 4.57µs ± 3% 4.58µs ± 2% ~ (p=0.466 n=25+22)
BM_cast<double,int>/256k 120µs ± 2% 120µs ± 2% ~ (p=0.476 n=25+25)
AVX2:
name old cpu/op new cpu/op delta
BM_cast<float,int>/8 13.3ns ± 1% 11.9ns ± 0% -10.49% (p=0.000 n=58+59)
BM_cast<float,int>/64 15.7ns ± 0% 13.4ns ± 0% -14.66% (p=0.000 n=43+56)
BM_cast<float,int>/512 32.8ns ± 2% 30.2ns ± 3% -8.13% (p=0.000 n=49+52)
BM_cast<float,int>/4k 216ns ± 8% 228ns ±10% +5.98% (p=0.000 n=49+48)
BM_cast<float,int>/32k 2.67µs ± 3% 2.66µs ± 2% ~ (p=0.057 n=54+53)
BM_cast<float,int>/256k 79.7µs ± 3% 80.2µs ± 1% +0.67% (p=0.004 n=60+60)
BM_cast<int,float>/8 13.2ns ± 0% 11.9ns ± 0% -10.11% (p=0.000 n=53+59)
BM_cast<int,float>/64 15.7ns ± 0% 13.4ns ± 0% -14.72% (p=0.000 n=47+55)
BM_cast<int,float>/512 32.8ns ± 2% 30.1ns ± 2% -8.08% (p=0.000 n=49+55)
BM_cast<int,float>/4k 210ns ±10% 223ns ±12% +6.23% (p=0.000 n=59+58)
BM_cast<int,float>/32k 2.68µs ± 2% 2.66µs ± 3% -0.70% (p=0.004 n=55+55)
BM_cast<int,float>/256k 79.7µs ± 2% 80.2µs ± 1% +0.63% (p=0.006 n=59+59)
BM_cast<double,int>/8 14.0ns ± 0% 10.4ns ± 0% -25.90% (p=0.000 n=20+22)
BM_cast<double,int>/64 18.2ns ± 0% 15.9ns ± 0% -12.81% (p=0.000 n=23+18)
BM_cast<double,int>/512 56.8ns ± 1% 53.8ns ± 1% -5.31% (p=0.000 n=20+24)
BM_cast<double,int>/4k 586ns ± 1% 588ns ± 3% ~ (p=0.486 n=23+23)
BM_cast<double,int>/32k 4.62µs ± 3% 4.66µs ± 1% +0.89% (p=0.006 n=24+23)
BM_cast<double,int>/256k 121µs ± 1% 121µs ± 1% ~ (p=0.358 n=23+24)
AVX512F
name old cpu/op new cpu/op delta
BM_cast<float,int>/8 18.5ns ± 0% 14.0ns ± 0% -24.31% (p=0.000 n=56+59)
BM_cast<float,int>/64 19.4ns ± 3% 14.0ns ± 3% -27.81% (p=0.000 n=59+60)
BM_cast<float,int>/512 64.2ns ±10% 21.3ns ± 5% -66.77% (p=0.000 n=58+59)
BM_cast<float,int>/4k 405ns ± 3% 108ns ± 7% -73.40% (p=0.000 n=50+60)
BM_cast<float,int>/32k 4.95µs ± 3% 2.69µs ± 3% -45.55% (p=0.000 n=57+53)
BM_cast<float,int>/256k 92.4µs ± 5% 81.0µs ± 2% -12.35% (p=0.000 n=60+60)
BM_cast<int,float>/8 18.3ns ± 1% 13.8ns ± 1% -24.69% (p=0.000 n=57+60)
BM_cast<int,float>/64 19.4ns ± 3% 13.9ns ± 3% -28.14% (p=0.000 n=57+60)
BM_cast<int,float>/512 65.1ns ± 7% 21.3ns ± 5% -67.29% (p=0.000 n=55+59)
BM_cast<int,float>/4k 414ns ± 3% 108ns ± 7% -73.98% (p=0.000 n=52+60)
BM_cast<int,float>/32k 4.94µs ± 3% 2.69µs ± 4% -45.49% (p=0.000 n=57+54)
BM_cast<int,float>/256k 91.2µs ± 5% 81.0µs ± 2% -11.16% (p=0.000 n=60+59)
BM_cast<double,int>/8 19.9ns ± 2% 13.0ns ± 0% -34.89% (p=0.000 n=21+23)
BM_cast<double,int>/64 20.5ns ± 3% 15.2ns ± 4% -25.76% (p=0.000 n=23+25)
BM_cast<double,int>/512 72.8ns ±14% 37.9ns ± 6% -47.97% (p=0.000 n=24+20)
BM_cast<double,int>/4k 844ns ± 5% 477ns ± 4% -43.44% (p=0.000 n=25+23)
BM_cast<double,int>/32k 6.82µs ± 5% 3.82µs ± 3% -43.95% (p=0.000 n=25+20)
BM_cast<double,int>/256k 167µs ± 4% 122µs ± 2% -26.94% (p=0.000 n=25+25)
Edited by Rasmus Munk Larsen