Optimize casting for x86_64.
This MR optimizes several pcast
operators for the x86 backends, in particular cast to bool, which will become more important in light of the recent improvements by @chuckyschluz to typed comparison and the Select operator. Currently, this mainly benefits users of the Tensor library, but hopefully we can also find a way to make casting at least partially vectorized in Eigen Core.
Speedup of casting is measured on a Skylake (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz) using the following code:
template <typename IN, typename OUT>
static void BM_cast(benchmark::State& state) {
int n = state.range(0);
const Eigen::array<TensorIndex, 1> sizes{n,};
const Tensor<IN, 1, 0, TensorIndex> A(sizes);
Tensor<OUT, 1, 0, TensorIndex> B(sizes);
for (auto s : state) {
benchmark::DoNotOptimize(B = A.template cast<OUT>());
}
state.SetItemsProcessed(static_cast<int64>(state.iterations()) * state.range(0));
}
Benchmarks numbers for affected cast operations:
SSE:
BM_cast<float,bool>/8 18.2ns ± 0% 16.6ns ± 0% -8.65% (p=0.000 n=52+52)
BM_cast<float,bool>/64 17.9ns ± 1% 15.2ns ± 0% -15.11% (p=0.000 n=55+52)
BM_cast<float,bool>/512 67.2ns ± 9% 41.1ns ± 2% -38.89% (p=0.000 n=57+45)
BM_cast<float,bool>/4k 465ns ±11% 249ns ± 1% -46.37% (p=0.000 n=57+49)
BM_cast<float,bool>/32k 3.84µs ± 1% 2.31µs ± 1% -39.96% (p=0.000 n=40+51)
BM_cast<float,bool>/256k 42.5µs ± 6% 36.2µs ± 8% -14.79% (p=0.000 n=60+49)
BM_cast<float,bool>/1M 199µs ± 3% 196µs ± 3% -1.64% (p=0.000 n=60+60)
AVX:
BM_cast<float,bool>/8 18.3ns ± 1% 14.9ns ± 0% -18.26% (p=0.000 n=47+56)
BM_cast<float,bool>/64 17.9ns ± 4% 15.0ns ± 0% -16.55% (p=0.000 n=57+51)
BM_cast<float,bool>/512 69.6ns ± 6% 59.3ns ± 1% -14.85% (p=0.000 n=60+60)
BM_cast<float,bool>/4k 495ns ± 7% 430ns ± 0% -13.07% (p=0.000 n=58+49)
BM_cast<float,bool>/32k 4.25µs ± 4% 3.44µs ± 3% -19.08% (p=0.000 n=57+50)
BM_cast<float,bool>/256k 44.7µs ± 2% 41.2µs ± 5% -7.88% (p=0.000 n=49+54)
BM_cast<float,bool>/1M 202µs ± 1% 201µs ± 2% -0.60% (p=0.000 n=50+60)
BM_cast<double,float>/8 14.1ns ± 0% 12.2ns ± 0% -13.91% (p=0.000 n=44+53)
BM_cast<double,float>/64 18.3ns ± 1% 15.9ns ± 0% -12.99% (p=0.000 n=57+49)
BM_cast<double,float>/512 55.7ns ± 3% 53.8ns ± 8% -3.41% (p=0.000 n=54+56)
BM_cast<double,float>/4k 576ns ± 3% 577ns ± 4% ~ (p=0.934 n=54+60)
BM_cast<double,float>/32k 4.52µs ± 4% 4.59µs ± 4% +1.35% (p=0.000 n=57+60)
BM_cast<double,float>/256k 119µs ± 1% 119µs ± 2% ~ (p=0.577 n=56+51)
BM_cast<double,float>/1M 479µs ± 2% 478µs ± 2% ~ (p=0.547 n=45+37)
AVX2:
BM_cast<float,bool>/8 18.3ns ± 0% 17.8ns ± 0% -2.44% (p=0.000 n=46+48)
BM_cast<float,bool>/64 17.8ns ± 2% 18.2ns ± 0% +2.61% (p=0.000 n=55+48)
BM_cast<float,bool>/512 66.6ns ± 8% 52.0ns ± 9% -21.84% (p=0.000 n=53+60)
BM_cast<float,bool>/4k 466ns ± 1% 322ns ± 3% -30.94% (p=0.000 n=53+51)
BM_cast<float,bool>/32k 4.18µs ± 4% 2.71µs ± 5% -35.20% (p=0.000 n=49+55)
BM_cast<float,bool>/256k 44.8µs ± 6% 39.6µs ± 6% -11.63% (p=0.000 n=58+49)
BM_cast<float,bool>/1M 204µs ± 3% 200µs ± 2% -1.71% (p=0.000 n=60+59)
BM_cast<double,float>/8 14.1ns ± 1% 12.2ns ± 0% -13.95% (p=0.000 n=53+47)
BM_cast<double,float>/64 18.2ns ± 0% 15.9ns ± 0% -12.85% (p=0.000 n=54+46)
BM_cast<double,float>/512 56.1ns ± 6% 56.5ns ± 9% ~ (p=0.281 n=59+60)
BM_cast<double,float>/4k 579ns ± 4% 578ns ± 4% ~ (p=0.385 n=60+60)
BM_cast<double,float>/32k 4.55µs ± 4% 4.56µs ± 5% ~ (p=0.421 n=60+56)
BM_cast<double,float>/256k 120µs ± 2% 120µs ± 2% ~ (p=0.069 n=57+58)
BM_cast<double,float>/1M 482µs ± 3% 480µs ± 2% ~ (p=0.155 n=38+46)
AVX512:
BM_cast<bool,float>/8 20.3ns ± 4% 14.9ns ± 1% -26.36% (p=0.000 n=50+60)
BM_cast<bool,float>/64 21.0ns ± 4% 14.0ns ± 3% -33.12% (p=0.000 n=58+55)
BM_cast<bool,float>/512 72.9ns ± 7% 26.8ns ± 9% -63.27% (p=0.000 n=60+55)
BM_cast<bool,float>/4k 464ns ± 4% 134ns ± 5% -71.11% (p=0.000 n=53+60)
BM_cast<bool,float>/32k 4.09µs ±10% 2.40µs ± 3% -41.35% (p=0.000 n=52+52)
BM_cast<bool,float>/256k 46.5µs ± 4% 40.6µs ± 3% -12.66% (p=0.000 n=58+50)
BM_cast<bool,float>/1M 224µs ± 6% 201µs ± 2% -10.47% (p=0.000 n=55+59)
BM_cast<float,bool>/8 18.0ns ± 0% 15.7ns ± 4% -12.81% (p=0.000 n=53+60)
BM_cast<float,bool>/64 18.0ns ± 4% 12.6ns ± 4% -30.23% (p=0.000 n=53+60)
BM_cast<float,bool>/512 66.0ns ± 6% 29.5ns ± 8% -55.30% (p=0.000 n=57+52)
BM_cast<float,bool>/4k 452ns ± 5% 227ns ± 4% -49.73% (p=0.000 n=59+59)
BM_cast<float,bool>/32k 3.85µs ± 6% 1.82µs ± 4% -52.79% (p=0.000 n=50+60)
BM_cast<float,bool>/256k 44.3µs ± 8% 33.9µs ± 5% -23.44% (p=0.000 n=60+50)
BM_cast<float,bool>/1M 202µs ± 3% 199µs ± 1% -1.38% (p=0.000 n=59+58)
BM_cast<double,float>/8 20.7ns ± 2% 13.3ns ± 1% -36.02% (p=0.000 n=57+54)
BM_cast<double,float>/64 21.0ns ± 8% 15.1ns ± 4% -28.17% (p=0.000 n=60+59)
BM_cast<double,float>/512 72.9ns ± 8% 38.0ns ± 7% -47.88% (p=0.000 n=59+48)
BM_cast<double,float>/4k 845ns ± 7% 475ns ± 8% -43.74% (p=0.000 n=60+56)
BM_cast<double,float>/32k 6.74µs ± 7% 3.78µs ± 5% -43.96% (p=0.000 n=57+47)
BM_cast<double,float>/256k 169µs ± 3% 121µs ± 3% -28.47% (p=0.000 n=57+60)
BM_cast<double,float>/1M 681µs ± 5% 486µs ± 5% -28.64% (p=0.000 n=41+40)
Edited by Rasmus Munk Larsen