Skip to content

Optimize casting for x86_64.

This MR optimizes several pcast operators for the x86 backends, in particular cast to bool, which will become more important in light of the recent improvements by @chuckyschluz to typed comparison and the Select operator. Currently, this mainly benefits users of the Tensor library, but hopefully we can also find a way to make casting at least partially vectorized in Eigen Core.

Speedup of casting is measured on a Skylake (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz) using the following code:

template <typename IN, typename OUT>
static void BM_cast(benchmark::State& state) {
  int n = state.range(0);
  const Eigen::array<TensorIndex, 1> sizes{n,};
  const Tensor<IN, 1, 0, TensorIndex> A(sizes);
  Tensor<OUT, 1, 0, TensorIndex> B(sizes);

  for (auto s : state) {
    benchmark::DoNotOptimize(B = A.template cast<OUT>());
  }
  state.SetItemsProcessed(static_cast<int64>(state.iterations()) * state.range(0));
}

Benchmarks numbers for affected cast operations:

SSE:
BM_cast<float,bool>/8        18.2ns ± 0%  16.6ns ± 0%   -8.65%  (p=0.000 n=52+52)
BM_cast<float,bool>/64       17.9ns ± 1%  15.2ns ± 0%  -15.11%  (p=0.000 n=55+52)
BM_cast<float,bool>/512      67.2ns ± 9%  41.1ns ± 2%  -38.89%  (p=0.000 n=57+45)
BM_cast<float,bool>/4k        465ns ±11%   249ns ± 1%  -46.37%  (p=0.000 n=57+49)
BM_cast<float,bool>/32k      3.84µs ± 1%  2.31µs ± 1%  -39.96%  (p=0.000 n=40+51)
BM_cast<float,bool>/256k     42.5µs ± 6%  36.2µs ± 8%  -14.79%  (p=0.000 n=60+49)
BM_cast<float,bool>/1M        199µs ± 3%   196µs ± 3%   -1.64%  (p=0.000 n=60+60)

AVX:
BM_cast<float,bool>/8        18.3ns ± 1%  14.9ns ± 0%  -18.26%  (p=0.000 n=47+56)
BM_cast<float,bool>/64       17.9ns ± 4%  15.0ns ± 0%  -16.55%  (p=0.000 n=57+51)
BM_cast<float,bool>/512      69.6ns ± 6%  59.3ns ± 1%  -14.85%  (p=0.000 n=60+60)
BM_cast<float,bool>/4k        495ns ± 7%   430ns ± 0%  -13.07%  (p=0.000 n=58+49)
BM_cast<float,bool>/32k      4.25µs ± 4%  3.44µs ± 3%  -19.08%  (p=0.000 n=57+50)
BM_cast<float,bool>/256k     44.7µs ± 2%  41.2µs ± 5%   -7.88%  (p=0.000 n=49+54)
BM_cast<float,bool>/1M        202µs ± 1%   201µs ± 2%   -0.60%  (p=0.000 n=50+60)
BM_cast<double,float>/8      14.1ns ± 0%  12.2ns ± 0%  -13.91%  (p=0.000 n=44+53)
BM_cast<double,float>/64     18.3ns ± 1%  15.9ns ± 0%  -12.99%  (p=0.000 n=57+49)
BM_cast<double,float>/512    55.7ns ± 3%  53.8ns ± 8%   -3.41%  (p=0.000 n=54+56)
BM_cast<double,float>/4k      576ns ± 3%   577ns ± 4%     ~     (p=0.934 n=54+60)
BM_cast<double,float>/32k    4.52µs ± 4%  4.59µs ± 4%   +1.35%  (p=0.000 n=57+60)
BM_cast<double,float>/256k    119µs ± 1%   119µs ± 2%     ~     (p=0.577 n=56+51)
BM_cast<double,float>/1M      479µs ± 2%   478µs ± 2%     ~     (p=0.547 n=45+37)

AVX2:
BM_cast<float,bool>/8        18.3ns ± 0%  17.8ns ± 0%   -2.44%  (p=0.000 n=46+48)
BM_cast<float,bool>/64       17.8ns ± 2%  18.2ns ± 0%   +2.61%  (p=0.000 n=55+48)
BM_cast<float,bool>/512      66.6ns ± 8%  52.0ns ± 9%  -21.84%  (p=0.000 n=53+60)
BM_cast<float,bool>/4k        466ns ± 1%   322ns ± 3%  -30.94%  (p=0.000 n=53+51)
BM_cast<float,bool>/32k      4.18µs ± 4%  2.71µs ± 5%  -35.20%  (p=0.000 n=49+55)
BM_cast<float,bool>/256k     44.8µs ± 6%  39.6µs ± 6%  -11.63%  (p=0.000 n=58+49)
BM_cast<float,bool>/1M        204µs ± 3%   200µs ± 2%   -1.71%  (p=0.000 n=60+59)
BM_cast<double,float>/8      14.1ns ± 1%  12.2ns ± 0%  -13.95%  (p=0.000 n=53+47)
BM_cast<double,float>/64     18.2ns ± 0%  15.9ns ± 0%  -12.85%  (p=0.000 n=54+46)
BM_cast<double,float>/512    56.1ns ± 6%  56.5ns ± 9%     ~     (p=0.281 n=59+60)
BM_cast<double,float>/4k      579ns ± 4%   578ns ± 4%     ~     (p=0.385 n=60+60)
BM_cast<double,float>/32k    4.55µs ± 4%  4.56µs ± 5%     ~     (p=0.421 n=60+56)
BM_cast<double,float>/256k    120µs ± 2%   120µs ± 2%     ~     (p=0.069 n=57+58)
BM_cast<double,float>/1M      482µs ± 3%   480µs ± 2%     ~     (p=0.155 n=38+46)

AVX512:
BM_cast<bool,float>/8        20.3ns ± 4%  14.9ns ± 1%  -26.36%  (p=0.000 n=50+60)
BM_cast<bool,float>/64       21.0ns ± 4%  14.0ns ± 3%  -33.12%  (p=0.000 n=58+55)
BM_cast<bool,float>/512      72.9ns ± 7%  26.8ns ± 9%  -63.27%  (p=0.000 n=60+55)
BM_cast<bool,float>/4k        464ns ± 4%   134ns ± 5%  -71.11%  (p=0.000 n=53+60)
BM_cast<bool,float>/32k      4.09µs ±10%  2.40µs ± 3%  -41.35%  (p=0.000 n=52+52)
BM_cast<bool,float>/256k     46.5µs ± 4%  40.6µs ± 3%  -12.66%  (p=0.000 n=58+50)
BM_cast<bool,float>/1M        224µs ± 6%   201µs ± 2%  -10.47%  (p=0.000 n=55+59)
BM_cast<float,bool>/8        18.0ns ± 0%  15.7ns ± 4%  -12.81%  (p=0.000 n=53+60)
BM_cast<float,bool>/64       18.0ns ± 4%  12.6ns ± 4%  -30.23%  (p=0.000 n=53+60)
BM_cast<float,bool>/512      66.0ns ± 6%  29.5ns ± 8%  -55.30%  (p=0.000 n=57+52)
BM_cast<float,bool>/4k        452ns ± 5%   227ns ± 4%  -49.73%  (p=0.000 n=59+59)
BM_cast<float,bool>/32k      3.85µs ± 6%  1.82µs ± 4%  -52.79%  (p=0.000 n=50+60)
BM_cast<float,bool>/256k     44.3µs ± 8%  33.9µs ± 5%  -23.44%  (p=0.000 n=60+50)
BM_cast<float,bool>/1M        202µs ± 3%   199µs ± 1%   -1.38%  (p=0.000 n=59+58)
BM_cast<double,float>/8      20.7ns ± 2%  13.3ns ± 1%  -36.02%  (p=0.000 n=57+54)
BM_cast<double,float>/64     21.0ns ± 8%  15.1ns ± 4%  -28.17%  (p=0.000 n=60+59)
BM_cast<double,float>/512    72.9ns ± 8%  38.0ns ± 7%  -47.88%  (p=0.000 n=59+48)
BM_cast<double,float>/4k      845ns ± 7%   475ns ± 8%  -43.74%  (p=0.000 n=60+56)
BM_cast<double,float>/32k    6.74µs ± 7%  3.78µs ± 5%  -43.96%  (p=0.000 n=57+47)
BM_cast<double,float>/256k    169µs ± 3%   121µs ± 3%  -28.47%  (p=0.000 n=57+60)
BM_cast<double,float>/1M      681µs ± 5%   486µs ± 5%  -28.64%  (p=0.000 n=41+40)
Edited by Rasmus Munk Larsen

Merge request reports

Loading