Vectorize cast
Reference issue
What does this implement/fix?
Specialize the evaluator of scalar_cast_op to handle different input and output packet types, multiple input packets per output packet, etc. Works as long as type_casting_traits is correctly defined and pcast returns a single packet.
If the new packet size is less than the old packet size, then we are performing multiple loads for the same data (in fact, the entire expression). For example:
template<> EIGEN_STRONG_INLINE Packet2d pcast<Packet4f, Packet2d>(const Packet4f& a) {
// Simply discard the second half of the input
return _mm_cvtps_pd(a);
}
We would load elements {0,1,2,3}, increment the assignment loop by two, load elements {2,3,4,5}, and so on. We also reduce the alignment by half, and probably perform unalinged loads. This isn't great, but probably preferable to the scalar path. Recommend investigating a remedy to optimize casts to a smaller packet (number of elements).
Benchmarks:
For a pure cast, dst = src.cast<DstType>() I saw very little difference in performance from the scalar path, which is understandable as there is very little work being done to justify the overhead of the loads and stores. However, I also saw no decrease in performance, which is good.
The story is different for a more complex expression like dst = src.abs2().sqrt().log().cast<DstType>();
AVX (double->float): -55%
AVX (float->double): -59%
From this example, we see that the meat and potatoes of the expression -- the arithmetic operations -- vastly outweigh the the cost of the cast, even if we effectively evaluate the expression twice as is the cast for float->double. The similarity of the numbers is also explainable. In the double->float case, we invoke 1 Packet4d op to increment the loop by 4 elements. In float->double, we invoke 2 Packet8f ops to increment the loop by 8 elements. Overall, the packet ops per increment is the same.