Faster emulated half comparisons
Reference issue
What does this implement/fix?
IEEE floating point comparisons are essentially signed integer comparisons. Instead of converting fp16 to fp32 for comparisons, we can convert the sign-magnitude representation to two's complement and compare the bit patterns as signed integers. This handles all edge cases automatically (even +0 vs -0) except NaN. For that, we need to check for an ordered comparison. Results look pretty good! The new scalar version is much faster than the old vectorized version!
This MR implements provides the scalar and intel intrinsic for this concept. I expect we could do the same for arm, power, and so on.
We could try it for bfloat16, though the results would be less impressive (perhaps much worse) since the conversion from bfloat16 to f32 is easier.
Additional information
Ubuntu clang version 18.1.3 -DNDEBUG -O3 -mavx2
| Size | Scalar Old [ns] | Scalar New [ns] | Scalar Change [%] | Vector Old [ns] | Vector New [ns] | Vector Change [%] |
|---|---|---|---|---|---|---|
| 1 | 1.45 | 1.02 | -29.66% | 2.68 | 1.02 | -61.94% |
| 2 | 2.68 | 1.85 | -30.97% | 3.8 | 1.86 | -51.05% |
| 4 | 5.17 | 3.34 | -35.40% | 7.93 | 3.48 | -56.12% |
| 8 | 10.2 | 1.49 | -85.39% | 4.54 | 0.84 | -81.52% |
| 16 | 21.2 | 1.55 | -92.69% | 8.98 | 1.30 | -85.52% |
| 32 | 48.9 | 2.92 | -94.03% | 17.7 | 2.23 | -87.40% |
| 64 | 90.1 | 5.45 | -93.95% | 35.2 | 4.45 | -87.36% |
| 128 | 178 | 11.00 | -93.82% | 70.1 | 8.90 | -87.30% |
| 256 | 331 | 21.30 | -93.56% | 140 | 19.00 | -86.43% |
| 512 | 671 | 41.40 | -93.83% | 280 | 36.90 | -86.82% |
| 1024 | 1296 | 82.00 | -93.67% | 558 | 72.50 | -87.01% |
#include <benchmark/benchmark.h>
#include <Eigen/Core>
using namespace Eigen;
using T = half;
using Vec = VectorX<T>;
static void half_scalar(benchmark::State& state) {
Index n = state.range(0);
Vec a(n), b(n);
VectorX<bool> r(n);
a.setRandom();
b.setRandom();
for (auto s : state) {
for(Index i = 0; i < n; i++)
{
r.coeffRef(i) = a.coeff(i) < b.coeff(i);
}
benchmark::DoNotOptimize(r);
}
}
static void half_vector(benchmark::State& state) {
Index n = state.range(0);
Vec a(n), b(n), r(n);
a.setRandom();
b.setRandom();
for (auto s : state) {
r = a.cwiseTypedLess(b);
benchmark::DoNotOptimize(r);
}
}
BENCHMARK(half_scalar)->RangeMultiplier(2)->Range(1<<0, 1<<10);
BENCHMARK(half_vector)->RangeMultiplier(2)->Range(1<<0, 1<<10);
BENCHMARK_MAIN();