Skip to content

Partially Vectorize Cast

Reference issue

What does this implement/fix?

Specialize the evaluator of scalar_cast_op to handle different input and output packet types, multiple input packets per output packet, etc. Works as long as type_casting_traits is correctly defined and pcast returns a single packet.

Assuming type_casting_traits::VectorizedCast is 1 and the pcast is defined, there are 3 scenarios:

  1. The destination packet is the same size as or larger than the default source packet: fully vectorized
  2. The destination packet is smaller than the default source packet: a suitable half (or quarter) packet is selected to satisfy 1)
  3. The destination packet is smaller than the default source packet and no suitable half packet is available. A run-time check verifies if the packet load would not result in an out-of-bounds data access. Otherwise, the packet op is synthesized from scalar operations.

Benchmarks:

For a pure cast, dst = src.cast<DstType>() I saw very little difference in performance from the scalar path, which is understandable as there is very little work being done to justify the overhead of the loads and stores. However, I also saw no decrease in performance, which is good.

The story is different for a more complex expression like dst = src.abs2().sqrt().log().cast<DstType>();

AVX (double->float): -55%

From this example, we see that the meat and potatoes of the expression -- the arithmetic operations -- vastly outweigh the the cost of the cast.

Additional information

Edited by Charles Schlosser

Merge request reports

Loading