masked load/store framework

Reference issue

What does this implement/fix?

This MR adds the framework to use masked loads/stores (i.e. packet segments -- a contiguous portion of a packet) in Eigen's vectorized assignment loops. This allows us to fully vectorize odd-sized arrays by using a packet segment op for the tail end of an array (or the head in the case of unaligned arrays, where we tend to use an align-at-all-cost evaluation strategy).

To make this backwards compatible, we opt into the packet segments on a type basis. I only implemented this for AVX/AVX2 as I could not test other architectures. Currently, this supports Packet4(8)f, Packet4(8)i, Packet2(4)d, Packet2(4)l/u, Packet4(2)cf, and Packet2cd (Packet1cd cannot benefit as the entire array is always vectorized). Packets with scalar types smaller than 32 bits are not supported. The few intrinsics supported by SSE are not worth while as they are slow and segfault if we attempt to access unallocated memory.

To support this, I fixed an old "bug" for arrays that are not aligned on scalar, which only tends to come up for complex types since sizeof(T) > alignof(T). If we aren't guaranteed to align on scalar at compile time, we simply use unaligned instructions. This guarantees that the size of the head/tail loops are smaller than a packet. Previously, we reverted to a scalar loop for slice vectorization if the array was not aligned on scalar at runtime. This should be a lot better.

Additional information

I tested four scenarios where the simd type is Packet8f:

  1. control: size = n * PacketSize
  2. +1: size = n * PacketSize + 1
  3. +half: size = n * PacketSize + PacketSize / 2
  4. -1: size = n * PacketSize + PacketSize - 1

Whats REALLY interesting is that we see improvements even for the "control" scenario -- even if the masked load/store is not used, its faster! The gains appear to attenuate after the size is 1024, but there are no regressions except for the +1 case. Almost no difference between AVX/AVX2.

Ubuntu clang version 18.1.3 -DNDEBUG -O3 -mavx2

| Benchmark                               | Time (ns) | Change | CPU (ns) | Change | Iterations | Change |
|-----------------------------------------|-----------|--------|----------|--------|------------|--------|
| test_segment_linvec_control<float>/1    | 1.11      | -66.5% | 1.11     | -66.5% | 624444841  | 202.1% |
| test_segment_linvec_control<float>/8    | 1.61      | -31.1% | 1.61     | -31.1% | 431603479  | 45.7%  |
| test_segment_linvec_control<float>/64   | 8.93      | -0.4%  | 8.93     | -0.4%  | 78158246   | 0.7%   |
| test_segment_linvec_control<float>/512  | 71.2      | 0.0%   | 71.2     | 0.0%   | 9829538    | -0.6%  |
| test_segment_linvec_control<float>/1024 | 142       | 0.0%   | 142      | 0.0%   | 4918170    | -0.2%  |
| test_segment_linvec_1<float>/1          | 1.58      | -28.5% | 1.58     | -28.5% | 443047539  | 41.7%  |
| test_segment_linvec_1<float>/8          | 2.22      | 1.8%   | 2.22     | 1.8%   | 313482343  | -1.3%  |
| test_segment_linvec_1<float>/64         | 10        | 1.0%   | 10       | 1.0%   | 69921081   | -1.8%  |
| test_segment_linvec_1<float>/512        | 72.4      | 1.8%   | 72.4     | 1.8%   | 9684202    | -1.9%  |
| test_segment_linvec_1<float>/1024       | 143       | 2.1%   | 143      | 2.1%   | 4881230    | -1.9%  |
| test_segment_linvec_half<float>/1       | 2.23      | -48.9% | 2.23     | -48.9% | 314013507  | 95.8%  |
| test_segment_linvec_half<float>/8       | 5.56      | -59.2% | 5.56     | -59.2% | 125809560  | 145.6% |
| test_segment_linvec_half<float>/64      | 13.4      | -23.9% | 13.4     | -23.9% | 52423334   | 30.9%  |
| test_segment_linvec_half<float>/512     | 75.7      | -2.6%  | 75.7     | -2.6%  | 9219309    | 3.1%   |
| test_segment_linvec_half<float>/1024    | 147       | -0.7%  | 147      | -0.7%  | 4767863    | 0.4%   |
| test_segment_linvec_m1<float>/1         | 3.91      | -70.8% | 3.91     | -71.1% | 179881626  | 243.4% |
| test_segment_linvec_m1<float>/8         | 8.9       | -74.5% | 8.89     | -74.5% | 78429901   | 293.6% |
| test_segment_linvec_m1<float>/64        | 16.7      | -38.9% | 16.7     | -38.9% | 41900570   | 63.8%  |
| test_segment_linvec_m1<float>/512       | 79.1      | -6.7%  | 79.1     | -6.8%  | 8867746    | 7.2%   |
| test_segment_linvec_m1<float>/1024      | 151       | -3.3%  | 151      | -3.3%  | 4581060    | 4.6%   |

bm code

#include <benchmark/benchmark.h>
#include <Eigen/Core>
using namespace Eigen;

template<typename T>
static void test_segment_linvec_control(benchmark::State& state) {
  using Mat = ArrayX<T>;
  constexpr Index PacketSize = internal::packet_traits<T>::size;
  Index n = numext::round_down(state.range(0), PacketSize);
  Mat A(n), B(n);
  for(Index i = 0; i < n; i++) A(i) = internal::random<T>(T(1.0), T(10.0));
  for (auto s : state) {
    B = (A + A).cwiseAbs2().cwiseSqrt();
    benchmark::DoNotOptimize(A);
    benchmark::DoNotOptimize(B);
  }
}

template<typename T>
static void test_segment_linvec_1(benchmark::State& state) {
  using Mat = ArrayX<T>;
  constexpr Index PacketSize = internal::packet_traits<T>::size;
  Index n = numext::round_down(state.range(0), PacketSize) + 1;
  Mat A(n), B(n);
  for(Index i = 0; i < n; i++) A(i) = internal::random<T>(T(1.0), T(10.0));
  for (auto s : state) {
    B = (A + A).cwiseAbs2().cwiseSqrt();
    benchmark::DoNotOptimize(A);
    benchmark::DoNotOptimize(B);
  }
}

template<typename T>
static void test_segment_linvec_half(benchmark::State& state) {
  using Mat = ArrayX<T>;
  constexpr Index PacketSize = internal::packet_traits<T>::size;
  Index n = numext::round_down(state.range(0), PacketSize) + PacketSize / 2;
  Mat A(n), B(n);
  for(Index i = 0; i < n; i++) A(i) = internal::random<T>(T(1.0), T(10.0));
  for (auto s : state) {
    B = (A + A).cwiseAbs2().cwiseSqrt();
    benchmark::DoNotOptimize(A);
    benchmark::DoNotOptimize(B);
  }
}

template<typename T>
static void test_segment_linvec_m1(benchmark::State& state) {
  using Mat = ArrayX<T>;
  constexpr Index PacketSize = internal::packet_traits<T>::size;
  Index n = numext::round_down(state.range(0), PacketSize) + PacketSize - 1;
  Mat A(n), B(n);
  for(Index i = 0; i < n; i++) A(i) = internal::random<T>(T(1.0), T(10.0));
  for (auto s : state) {
    B = (A + A).cwiseAbs2().cwiseSqrt();
    benchmark::DoNotOptimize(A);
    benchmark::DoNotOptimize(B);
  }
}

BENCHMARK(test_segment_linvec_control<float>)->Range(1<<0, 1<<10);
BENCHMARK(test_segment_linvec_1<float>)->Range(1<<0, 1<<10);
BENCHMARK(test_segment_linvec_half<float>)->Range(1<<0, 1<<10);
BENCHMARK(test_segment_linvec_m1<float>)->Range(1<<0, 1<<10);
BENCHMARK_MAIN();
Edited by Charles Schlosser

Merge request reports

Loading