Faster emulated half comparisons

Reference issue

What does this implement/fix?

IEEE floating point comparisons are essentially signed integer comparisons. Instead of converting fp16 to fp32 for comparisons, we can convert the sign-magnitude representation to two's complement and compare the bit patterns as signed integers. This handles all edge cases automatically (even +0 vs -0) except NaN. For that, we need to check for an ordered comparison. Results look pretty good! The new scalar version is much faster than the old vectorized version!

This MR implements provides the scalar and intel intrinsic for this concept. I expect we could do the same for arm, power, and so on.

We could try it for bfloat16, though the results would be less impressive (perhaps much worse) since the conversion from bfloat16 to f32 is easier.

Additional information

Ubuntu clang version 18.1.3 -DNDEBUG -O3 -mavx2

Size Scalar Old [ns] Scalar New [ns] Scalar Change [%] Vector Old [ns] Vector New [ns] Vector Change [%]
1 1.45 1.02 -29.66% 2.68 1.02 -61.94%
2 2.68 1.85 -30.97% 3.8 1.86 -51.05%
4 5.17 3.34 -35.40% 7.93 3.48 -56.12%
8 10.2 1.49 -85.39% 4.54 0.84 -81.52%
16 21.2 1.55 -92.69% 8.98 1.30 -85.52%
32 48.9 2.92 -94.03% 17.7 2.23 -87.40%
64 90.1 5.45 -93.95% 35.2 4.45 -87.36%
128 178 11.00 -93.82% 70.1 8.90 -87.30%
256 331 21.30 -93.56% 140 19.00 -86.43%
512 671 41.40 -93.83% 280 36.90 -86.82%
1024 1296 82.00 -93.67% 558 72.50 -87.01%
#include <benchmark/benchmark.h>
#include <Eigen/Core>
using namespace Eigen;

using T = half;
using Vec = VectorX<T>;

static void half_scalar(benchmark::State& state) {
  Index n = state.range(0);
  Vec a(n), b(n);
  VectorX<bool> r(n);
  a.setRandom();
  b.setRandom();
  for (auto s : state) {
    for(Index i = 0; i < n; i++)
    {
      r.coeffRef(i) = a.coeff(i) < b.coeff(i);
    }
    benchmark::DoNotOptimize(r);
  }
}

static void half_vector(benchmark::State& state) {
  Index n = state.range(0);
  Vec a(n), b(n), r(n);
  a.setRandom();
  b.setRandom();
  for (auto s : state) {
    r = a.cwiseTypedLess(b);
    benchmark::DoNotOptimize(r);
  }
}

BENCHMARK(half_scalar)->RangeMultiplier(2)->Range(1<<0, 1<<10);
BENCHMARK(half_vector)->RangeMultiplier(2)->Range(1<<0, 1<<10);
BENCHMARK_MAIN();
Edited by Rasmus Munk Larsen

Merge request reports

Loading