Skip to content

Optimize float->bool cast for AVX2, based on Charles Schlosser's comments.

Thanks for the catch, @chuckyschluz . This nets a nice little speedup for pcast<Packet8f,Packet16b> with AVX2 enabled:

Both versions compared to the same baseline prior to !1272 (merged):

Before:
name                      old cpu/op   new cpu/op   delta
BM_cast<float,bool>/8        18.3ns ± 0%  17.8ns ± 0%   -2.44%  (p=0.000 n=46+48)
BM_cast<float,bool>/64       17.8ns ± 2%  18.2ns ± 0%   +2.61%  (p=0.000 n=55+48)
BM_cast<float,bool>/512      66.6ns ± 8%  52.0ns ± 9%  -21.84%  (p=0.000 n=53+60)
BM_cast<float,bool>/4k        466ns ± 1%   322ns ± 3%  -30.94%  (p=0.000 n=53+51)
BM_cast<float,bool>/32k      4.18µs ± 4%  2.71µs ± 5%  -35.20%  (p=0.000 n=49+55)
BM_cast<float,bool>/256k     44.8µs ± 6%  39.6µs ± 6%  -11.63%  (p=0.000 n=58+49)
BM_cast<float,bool>/1M        204µs ± 3%   200µs ± 2%   -1.71%  (p=0.000 n=60+59)

After:
name                      old cpu/op   new cpu/op   delta
BM_cast<float,bool>/8     18.3ns ± 0%  17.9ns ± 0%   -2.03%  (p=0.000 n=45+53)
BM_cast<float,bool>/64    17.6ns ± 1%  16.8ns ± 0%   -4.71%  (p=0.000 n=55+45)
BM_cast<float,bool>/512   65.5ns ± 1%  46.1ns ± 5%  -29.61%  (p=0.000 n=51+59)
BM_cast<float,bool>/4k     470ns ± 2%   260ns ± 3%  -44.53%  (p=0.000 n=47+53)
BM_cast<float,bool>/32k   4.21µs ± 6%  2.42µs ± 4%  -42.49%  (p=0.000 n=55+51)
BM_cast<float,bool>/256k  40.2µs ±27%  32.3µs ±52%  -19.69%  (p=0.000 n=60+50)
BM_cast<float,bool>/1M     190µs ±21%   196µs ± 6%     ~     (p=0.885 n=59+52)
Edited by Rasmus Munk Larsen

Merge request reports

Loading