Skip to content

Add more missing vectorized casts for int on x86, and remove redundant unit tests

The removed unit tests were redundant, and worse, some were invoking undefined behavior.

Benchmark measurements for affected operations:

Measured on Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz

SSE
BM_cast<double,int>/8      14.0ns ± 0%  10.8ns ± 3%  -22.97%  (p=0.000 n=16+20)
BM_cast<double,int>/64     21.4ns ± 0%  18.7ns ± 0%  -12.55%  (p=0.000 n=14+19)
BM_cast<double,int>/512     113ns ± 0%   108ns ± 0%   -3.81%  (p=0.000 n=16+19)
BM_cast<double,int>/4k      854ns ± 0%   851ns ± 0%   -0.41%  (p=0.000 n=18+18)
BM_cast<double,int>/32k    6.70µs ± 0%  6.70µs ± 0%     ~     (p=0.751 n=19+19)
BM_cast<double,int>/256k    117µs ± 1%   118µs ± 1%   +0.59%  (p=0.001 n=20+19)

vectorize, AVX
name                       old cpu/op   new cpu/op   delta
BM_cast<float,int>/8       13.2ns ± 0%  11.9ns ± 0%  -10.13%  (p=0.000 n=56+59)
BM_cast<float,int>/64      15.7ns ± 1%  13.4ns ± 0%  -14.66%  (p=0.000 n=43+58)
BM_cast<float,int>/512     32.6ns ± 2%  31.7ns ±19%   -2.75%  (p=0.000 n=50+54)
BM_cast<float,int>/4k       203ns ±11%   222ns ± 6%   +9.03%  (p=0.000 n=60+60)
BM_cast<float,int>/32k     2.65µs ± 2%  2.65µs ± 3%     ~     (p=0.150 n=53+53)
BM_cast<float,int>/256k    79.9µs ± 2%  79.8µs ± 1%   -0.15%  (p=0.040 n=57+60)
BM_cast<int,float>/8       13.2ns ± 0%  11.9ns ± 0%  -10.16%  (p=0.000 n=55+59)
BM_cast<int,float>/64      15.7ns ± 1%  13.4ns ± 1%  -14.64%  (p=0.000 n=41+59)
BM_cast<int,float>/512     32.6ns ± 2%  29.8ns ± 2%   -8.36%  (p=0.000 n=45+52)
BM_cast<int,float>/4k       203ns ±11%   221ns ± 6%   +8.87%  (p=0.000 n=60+59)
BM_cast<int,float>/32k     2.65µs ± 2%  2.64µs ± 3%     ~     (p=0.658 n=55+55)
BM_cast<int,float>/256k    79.8µs ± 2%  79.8µs ± 1%     ~     (p=0.650 n=60+59)
BM_cast<double,int>/8      14.0ns ± 1%  10.4ns ± 0%  -25.50%  (p=0.000 n=25+23)
BM_cast<double,int>/64     18.2ns ± 0%  15.9ns ± 0%  -12.76%  (p=0.000 n=23+25)
BM_cast<double,int>/512    56.6ns ± 3%  54.7ns ±11%   -3.21%  (p=0.006 n=25+25)
BM_cast<double,int>/4k      582ns ± 2%   581ns ± 2%     ~     (p=0.617 n=24+24)
BM_cast<double,int>/32k    4.57µs ± 3%  4.58µs ± 2%     ~     (p=0.466 n=25+22)
BM_cast<double,int>/256k    120µs ± 2%   120µs ± 2%     ~     (p=0.476 n=25+25)

AVX2:
name                       old cpu/op   new cpu/op   delta
BM_cast<float,int>/8       13.3ns ± 1%  11.9ns ± 0%  -10.49%  (p=0.000 n=58+59)
BM_cast<float,int>/64      15.7ns ± 0%  13.4ns ± 0%  -14.66%  (p=0.000 n=43+56)
BM_cast<float,int>/512     32.8ns ± 2%  30.2ns ± 3%   -8.13%  (p=0.000 n=49+52)
BM_cast<float,int>/4k       216ns ± 8%   228ns ±10%   +5.98%  (p=0.000 n=49+48)
BM_cast<float,int>/32k     2.67µs ± 3%  2.66µs ± 2%     ~     (p=0.057 n=54+53)
BM_cast<float,int>/256k    79.7µs ± 3%  80.2µs ± 1%   +0.67%  (p=0.004 n=60+60)
BM_cast<int,float>/8       13.2ns ± 0%  11.9ns ± 0%  -10.11%  (p=0.000 n=53+59)
BM_cast<int,float>/64      15.7ns ± 0%  13.4ns ± 0%  -14.72%  (p=0.000 n=47+55)
BM_cast<int,float>/512     32.8ns ± 2%  30.1ns ± 2%   -8.08%  (p=0.000 n=49+55)
BM_cast<int,float>/4k       210ns ±10%   223ns ±12%   +6.23%  (p=0.000 n=59+58)
BM_cast<int,float>/32k     2.68µs ± 2%  2.66µs ± 3%   -0.70%  (p=0.004 n=55+55)
BM_cast<int,float>/256k    79.7µs ± 2%  80.2µs ± 1%   +0.63%  (p=0.006 n=59+59)
BM_cast<double,int>/8      14.0ns ± 0%  10.4ns ± 0%  -25.90%  (p=0.000 n=20+22)
BM_cast<double,int>/64     18.2ns ± 0%  15.9ns ± 0%  -12.81%  (p=0.000 n=23+18)
BM_cast<double,int>/512    56.8ns ± 1%  53.8ns ± 1%   -5.31%  (p=0.000 n=20+24)
BM_cast<double,int>/4k      586ns ± 1%   588ns ± 3%     ~     (p=0.486 n=23+23)
BM_cast<double,int>/32k    4.62µs ± 3%  4.66µs ± 1%   +0.89%  (p=0.006 n=24+23)
BM_cast<double,int>/256k    121µs ± 1%   121µs ± 1%     ~     (p=0.358 n=23+24)


AVX512F
name                       old cpu/op   new cpu/op   delta
BM_cast<float,int>/8       18.5ns ± 0%  14.0ns ± 0%  -24.31%  (p=0.000 n=56+59)
BM_cast<float,int>/64      19.4ns ± 3%  14.0ns ± 3%  -27.81%  (p=0.000 n=59+60)
BM_cast<float,int>/512     64.2ns ±10%  21.3ns ± 5%  -66.77%  (p=0.000 n=58+59)
BM_cast<float,int>/4k       405ns ± 3%   108ns ± 7%  -73.40%  (p=0.000 n=50+60)
BM_cast<float,int>/32k     4.95µs ± 3%  2.69µs ± 3%  -45.55%  (p=0.000 n=57+53)
BM_cast<float,int>/256k    92.4µs ± 5%  81.0µs ± 2%  -12.35%  (p=0.000 n=60+60)
BM_cast<int,float>/8       18.3ns ± 1%  13.8ns ± 1%  -24.69%  (p=0.000 n=57+60)
BM_cast<int,float>/64      19.4ns ± 3%  13.9ns ± 3%  -28.14%  (p=0.000 n=57+60)
BM_cast<int,float>/512     65.1ns ± 7%  21.3ns ± 5%  -67.29%  (p=0.000 n=55+59)
BM_cast<int,float>/4k       414ns ± 3%   108ns ± 7%  -73.98%  (p=0.000 n=52+60)
BM_cast<int,float>/32k     4.94µs ± 3%  2.69µs ± 4%  -45.49%  (p=0.000 n=57+54)
BM_cast<int,float>/256k    91.2µs ± 5%  81.0µs ± 2%  -11.16%  (p=0.000 n=60+59)
BM_cast<double,int>/8      19.9ns ± 2%  13.0ns ± 0%  -34.89%  (p=0.000 n=21+23)
BM_cast<double,int>/64     20.5ns ± 3%  15.2ns ± 4%  -25.76%  (p=0.000 n=23+25)
BM_cast<double,int>/512    72.8ns ±14%  37.9ns ± 6%  -47.97%  (p=0.000 n=24+20)
BM_cast<double,int>/4k      844ns ± 5%   477ns ± 4%  -43.44%  (p=0.000 n=25+23)
BM_cast<double,int>/32k    6.82µs ± 5%  3.82µs ± 3%  -43.95%  (p=0.000 n=25+20)
BM_cast<double,int>/256k    167µs ± 4%   122µs ± 2%  -26.94%  (p=0.000 n=25+25)
Edited by Rasmus Munk Larsen

Merge request reports

Loading