Vectorize allFinite()
For example, this speeds up allFinite by about 2.7x with AVX.
before:
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
BM_allFinite/8_mean 38.9 38.9 17706244 1.645G items/s
BM_allFinite/64_mean 2318 2318 299977 1.767G items/s
BM_allFinite/512_mean 151051 151039 4619 1.737G items/s
BM_allFinite/1k_mean 611419 612018 1099 1.714G items/s
after:
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
BM_allFinite/8_mean 13.6 13.6 51560404 4.708G items/s
BM_allFinite/64_mean 843 843 830627 4.861G items/s
BM_allFinite/512_mean 54340 54352 12000 4.823G items/s
BM_allFinite/1k_mean 221361 221362 3168 4.737G items/s
Edited by Rasmus Munk Larsen