Optimize float->bool cast for AVX2, based on Charles Schlosser's comments.
Thanks for the catch, @chuckyschluz . This nets a nice little speedup for pcast<Packet8f,Packet16b>
with AVX2 enabled:
Both versions compared to the same baseline prior to !1272 (merged):
Before:
name old cpu/op new cpu/op delta
BM_cast<float,bool>/8 18.3ns ± 0% 17.8ns ± 0% -2.44% (p=0.000 n=46+48)
BM_cast<float,bool>/64 17.8ns ± 2% 18.2ns ± 0% +2.61% (p=0.000 n=55+48)
BM_cast<float,bool>/512 66.6ns ± 8% 52.0ns ± 9% -21.84% (p=0.000 n=53+60)
BM_cast<float,bool>/4k 466ns ± 1% 322ns ± 3% -30.94% (p=0.000 n=53+51)
BM_cast<float,bool>/32k 4.18µs ± 4% 2.71µs ± 5% -35.20% (p=0.000 n=49+55)
BM_cast<float,bool>/256k 44.8µs ± 6% 39.6µs ± 6% -11.63% (p=0.000 n=58+49)
BM_cast<float,bool>/1M 204µs ± 3% 200µs ± 2% -1.71% (p=0.000 n=60+59)
After:
name old cpu/op new cpu/op delta
BM_cast<float,bool>/8 18.3ns ± 0% 17.9ns ± 0% -2.03% (p=0.000 n=45+53)
BM_cast<float,bool>/64 17.6ns ± 1% 16.8ns ± 0% -4.71% (p=0.000 n=55+45)
BM_cast<float,bool>/512 65.5ns ± 1% 46.1ns ± 5% -29.61% (p=0.000 n=51+59)
BM_cast<float,bool>/4k 470ns ± 2% 260ns ± 3% -44.53% (p=0.000 n=47+53)
BM_cast<float,bool>/32k 4.21µs ± 6% 2.42µs ± 4% -42.49% (p=0.000 n=55+51)
BM_cast<float,bool>/256k 40.2µs ±27% 32.3µs ±52% -19.69% (p=0.000 n=60+50)
BM_cast<float,bool>/1M 190µs ±21% 196µs ± 6% ~ (p=0.885 n=59+52)
Edited by Rasmus Munk Larsen