Faster emulated half comparisons (!1910) · Merge requests · libeigen / eigen

Reference issue

What does this implement/fix?

IEEE floating point comparisons are essentially signed integer comparisons. Instead of converting fp16 to fp32 for comparisons, we can convert the sign-magnitude representation to two's complement and compare the bit patterns as signed integers. This handles all edge cases automatically (even +0 vs -0) except NaN. For that, we need to check for an ordered comparison. Results look pretty good! The new scalar version is much faster than the old vectorized version!

This MR implements provides the scalar and intel intrinsic for this concept. I expect we could do the same for arm, power, and so on.

We could try it for bfloat16, though the results would be less impressive (perhaps much worse) since the conversion from bfloat16 to f32 is easier.

Additional information

Ubuntu clang version 18.1.3 -DNDEBUG -O3 -mavx2

Size	Scalar Old [ns]	Scalar New [ns]	Scalar Change [%]	Vector Old [ns]	Vector New [ns]	Vector Change [%]
1	1.45	1.02	-29.66%	2.68	1.02	-61.94%
2	2.68	1.85	-30.97%	3.8	1.86	-51.05%
4	5.17	3.34	-35.40%	7.93	3.48	-56.12%
8	10.2	1.49	-85.39%	4.54	0.84	-81.52%
16	21.2	1.55	-92.69%	8.98	1.30	-85.52%
32	48.9	2.92	-94.03%	17.7	2.23	-87.40%
64	90.1	5.45	-93.95%	35.2	4.45	-87.36%
128	178	11.00	-93.82%	70.1	8.90	-87.30%
256	331	21.30	-93.56%	140	19.00	-86.43%
512	671	41.40	-93.83%	280	36.90	-86.82%
1024	1296	82.00	-93.67%	558	72.50	-87.01%

#include <benchmark/benchmark.h>
#include <Eigen/Core>
using namespace Eigen;

using T = half;
using Vec = VectorX<T>;

static void half_scalar(benchmark::State& state) {
  Index n = state.range(0);
  Vec a(n), b(n);
  VectorX<bool> r(n);
  a.setRandom();
  b.setRandom();
  for (auto s : state) {
    for(Index i = 0; i < n; i++)
    {
      r.coeffRef(i) = a.coeff(i) < b.coeff(i);
    }
    benchmark::DoNotOptimize(r);
  }
}

static void half_vector(benchmark::State& state) {
  Index n = state.range(0);
  Vec a(n), b(n), r(n);
  a.setRandom();
  b.setRandom();
  for (auto s : state) {
    r = a.cwiseTypedLess(b);
    benchmark::DoNotOptimize(r);
  }
}

BENCHMARK(half_scalar)->RangeMultiplier(2)->Range(1<<0, 1<<10);
BENCHMARK(half_vector)->RangeMultiplier(2)->Range(1<<0, 1<<10);
BENCHMARK_MAIN();

Edited Jun 16, 2025 by Rasmus Munk Larsen

Faster emulated half comparisons

Reference issue

What does this implement/fix?

Additional information

Merge request reports