more pblend optimizations
Reference issue
What does this implement/fix?
A few more optimizations to pblend. This provide a utility blend_mask_helper for generating a bitmask rom an array of bool. This uses explicit loop unrolling to convert the bool array (0/1) to a bitmask (0xff/0xff). The loop unrolling does not appear to be necessary for clang but doesn't hurt either. It does goad gcc into auto vectorizing the loop. The autovectorization is useful so we don't have to specialize the helper for every scalar size and account for subtle differences in available instructions (especially AVX vs. AVX2). Some observations:
- Even Clang is not smart enough to convert a floating point comparison to an integer comparison when original data is boolean
- unrolling the loop helps gcc apply auto vectorization.
- both gcc and clang know how to piece together SSE intrinsics to emulate AVX2 functionality once the loops are unrolled. clang is MUCH better at this, and I can't figure out how to nudge gcc to generate the same assembly
- using the signed integer type produces better assembly in gcc, probably because the integer intrinsics mostly apply to signed types (even though we are only concerned with 0 and 1)
- separating the zero extension (static cast to the integer type) and the negation helps gcc produce better assembly
I declare this bike shed to be fully painted.
Tensor<bool, 3> selector(sizea, sizeb, sizec);
Tensor<float, 3> mat1(sizea, sizeb, sizec);
Tensor<float, 3> mat2(sizea, sizeb, sizec);
Tensor<float, 3> result(sizea, sizeb, sizec);
int repeats = 1000;
for (int i = 0; i < repeats; i++)
{
result += selector.select(mat1, mat2);
}
LLVM on Windows: SSE: -0.13% AVX: -4.40% AVX2: -0.43%
With MSVC on Windows (besides run times being 3x longer), the current AVX code is actually faster than AVX2. Could it be that MSVC does not like to mix n' match integer and float code?